Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3142

recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

Details

    • 3
    • 7628

    Description

      While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:

      2013-04-08 22:25:26: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      ++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + FREE_SPACE=12963076
      + BLKS=2916692
      + echo 'Free disk space is 12963076, 4k blocks to dd is 2916692'
      + load_pid=8739
      + wait 8739
      + dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file
      dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
      295176+0 records in
      295175+0 records out
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2013-04-08 22:27:28: dd failed'
      + echo '2013-04-08 22:27:28: dd failed'
      2013-04-08 22:27:28: dd failed
      

      Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c

      Attachments

        Activity

          [LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
          pjones Peter Jones added a comment -

          James please could you open a new ticket to track this bug? It sounds like what you are describing is a regression introduced into this work (which was included in 2.4.0) by the work from LU-2193 (which happened since 2.4.0) so it would be easier to create a new ticket and link it to the other two related tickets. It can get really confusing trying to work out the situation for a given bug if there are commits to it spanning release boundaries.

          pjones Peter Jones added a comment - James please could you open a new ticket to track this bug? It sounds like what you are describing is a regression introduced into this work (which was included in 2.4.0) by the work from LU-2193 (which happened since 2.4.0) so it would be easier to create a new ticket and link it to the other two related tickets. It can get really confusing trying to work out the situation for a given bug if there are commits to it spanning release boundaries.

          Created a patch to fix this issue.

          http://review.whamcloud.com/#change,6566

          simmonsja James A Simmons added a comment - Created a patch to fix this issue. http://review.whamcloud.com/#change,6566
          simmonsja James A Simmons added a comment - - edited

          I found it. Patch http://review.whamcloud.com/#change,4501 removed the fid_build_from_res_name function. Creating patch and testing...

          simmonsja James A Simmons added a comment - - edited I found it. Patch http://review.whamcloud.com/#change,4501 removed the fid_build_from_res_name function. Creating patch and testing...

          I don't know how it got passed you build system but patch http://review.whamcloud.com/#change,6102 is missing the
          function fid_build_from_res_name.

          lustre-2.4.0/lustre/mdt/mdt_handler.c: In function ‘mdt_intent_layout’:
          lustre-2.4.0/lustre/mdt/mdt_handler.c:3754: error: implicit declaration of function ‘fid_build_from_res_name’

          simmonsja James A Simmons added a comment - I don't know how it got passed you build system but patch http://review.whamcloud.com/#change,6102 is missing the function fid_build_from_res_name. lustre-2.4.0/lustre/mdt/mdt_handler.c: In function ‘mdt_intent_layout’: lustre-2.4.0/lustre/mdt/mdt_handler.c:3754: error: implicit declaration of function ‘fid_build_from_res_name’
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.4.0 [ 10154 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Efficiency of solution will be improved under LU-2818. Current solution sufficient for 2.4

          pjones Peter Jones added a comment - Efficiency of solution will be improved under LU-2818 . Current solution sufficient for 2.4

          Not closing this bug until the redundant getxattr call has been cleaned up per inspection comments.

          adilger Andreas Dilger added a comment - Not closing this bug until the redundant getxattr call has been cleaned up per inspection comments.
          adilger Andreas Dilger made changes -
          Labels Original: MB New: LB
          yujian Jian Yu added a comment -

          A patch for master branch to gather the logs on passive server nodes in failure configuration: http://review.whamcloud.com/6112.

          yujian Jian Yu added a comment - A patch for master branch to gather the logs on passive server nodes in failure configuration: http://review.whamcloud.com/6112 .

          this bug should be caused by the wrong size of "MD" in MDT,

          00000004:00000001:21.0:1366291257.871077:0:4881:0:(mdd_object.c:268:mdd_xattr_get()) Process entered
          00000004:00000001:21.0:1366291257.871080:0:4881:0:(lod_object.c:373:lod_xattr_get()) Process entered
          00000004:00000001:21.0:1366291257.871085:0:4881:0:(lod_object.c:377:lod_xattr_get()) Process leaving (rc=176 : 176 : b0)
          00000004:00000001:21.0:1366291257.871087:0:4881:0:(mdd_object.c:281:mdd_xattr_get()) Process leaving (rc=176 : 176 : b0)
          00000004:00020000:21.0:1366291257.871089:0:4881:0:(mdt_lvb.c:158:mdt_lvbo_fill()) lustre-MDT0000: expected 176 actual 128.
          00000004:00000001:21.0:1366291257.880256:0:4881:0:(mdt_lvb.c:159:mdt_lvbo_fill()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)

          1, the default value of mdt_device->mdt_max_mdsize is 128bytes,
          #define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data)

          2, before failover, the MD size is changed to 176bytes = (sizeof(struct lov_mds_md) + 6 * sizeof(struct lov_ost_data)
          mdt_device->mdt_max_mdsize is updated accordingly (see mdt_attr_get_lov, will update "mdt_max_mdsize" in "getattr" request)

          3, after failover, the new MDT doesn't know the actual mdt_max_mdsize, and still use the default value, then client calls ll_layout_refresh
          to get the MD and the MDT will failed with ERANGE for there is still no "getattr" request to update mdt_device>mdt_max_mdsize!

          the patch is tracked at http://review.whamcloud.com/#change,6102

          hongchao.zhang Hongchao Zhang added a comment - this bug should be caused by the wrong size of "MD" in MDT, 00000004:00000001:21.0:1366291257.871077:0:4881:0:(mdd_object.c:268:mdd_xattr_get()) Process entered 00000004:00000001:21.0:1366291257.871080:0:4881:0:(lod_object.c:373:lod_xattr_get()) Process entered 00000004:00000001:21.0:1366291257.871085:0:4881:0:(lod_object.c:377:lod_xattr_get()) Process leaving (rc=176 : 176 : b0) 00000004:00000001:21.0:1366291257.871087:0:4881:0:(mdd_object.c:281:mdd_xattr_get()) Process leaving (rc=176 : 176 : b0) 00000004:00020000:21.0:1366291257.871089:0:4881:0:(mdt_lvb.c:158:mdt_lvbo_fill()) lustre-MDT0000: expected 176 actual 128. 00000004:00000001:21.0:1366291257.880256:0:4881:0:(mdt_lvb.c:159:mdt_lvbo_fill()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1, the default value of mdt_device->mdt_max_mdsize is 128bytes, #define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data) 2, before failover, the MD size is changed to 176bytes = (sizeof(struct lov_mds_md) + 6 * sizeof(struct lov_ost_data) mdt_device->mdt_max_mdsize is updated accordingly (see mdt_attr_get_lov, will update "mdt_max_mdsize" in "getattr" request) 3, after failover, the new MDT doesn't know the actual mdt_max_mdsize, and still use the default value, then client calls ll_layout_refresh to get the MD and the MDT will failed with ERANGE for there is still no "getattr" request to update mdt_device >mdt_max_mdsize! the patch is tracked at http://review.whamcloud.com/#change,6102

          People

            hongchao.zhang Hongchao Zhang
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: