Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3142

recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

Details

    • 3
    • 7628

    Description

      While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:

      2013-04-08 22:25:26: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      ++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + FREE_SPACE=12963076
      + BLKS=2916692
      + echo 'Free disk space is 12963076, 4k blocks to dd is 2916692'
      + load_pid=8739
      + wait 8739
      + dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file
      dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
      295176+0 records in
      295175+0 records out
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2013-04-08 22:27:28: dd failed'
      + echo '2013-04-08 22:27:28: dd failed'
      2013-04-08 22:27:28: dd failed
      

      Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c

      Attachments

        Activity

          [LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
          yujian Jian Yu added a comment -

          it could be related to http://review.whamcloud.com/5820

          Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.

          yujian Jian Yu added a comment - it could be related to http://review.whamcloud.com/5820 Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.
          yujian Jian Yu added a comment -

          Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

          I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/.

          The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

          yujian Jian Yu added a comment - Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 . I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/ . The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

          Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

          adilger Andreas Dilger added a comment - Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 .

          the logs in MDS doesn't contain any valid info about Lustre.

          the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret),
          and in Linux, it's only in the following modules

          driver/
          isdn, net, macintosh, ieee1394, atm, media, usb
          fs/
          jfss2, ncpfs
          net/
          iucv, atm, 9p, bluetooth
          sound/
          core, drivers, usb

          then this error could come from driver modules, or trigger at user space.

          hongchao.zhang Hongchao Zhang added a comment - the logs in MDS doesn't contain any valid info about Lustre. the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret), and in Linux, it's only in the following modules driver/ isdn, net, macintosh, ieee1394, atm, media, usb fs/ jfss2, ncpfs net/ iucv, atm, 9p, bluetooth sound/ core, drivers, usb then this error could come from driver modules, or trigger at user space.
          yujian Jian Yu added a comment -

          The recovery-*-scale tests on master branch have been blocked by LU-2008. After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.

          yujian Jian Yu added a comment - The recovery-*-scale tests on master branch have been blocked by LU-2008 . After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.
          pjones Peter Jones added a comment -

          Hongchao

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please look into this one? Thanks Peter

          Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

          adilger Andreas Dilger added a comment - Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

          People

            hongchao.zhang Hongchao Zhang
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: