Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2181

failure conf-sanity test_23a: umount -f client hung in stat() when MDS down

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • 3
    • 5219

    Description

      This issue was created by maloo for yujian <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/140674a6-16b2-11e2-962d-52540035b04c.

      Lustre Tag: v2_3_0_RC3
      Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
      Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
      Network: TCP
      ENABLE_QUOTA=yes

      The sub-test test_23a hung at unmounting the client:

      == conf-sanity test 23a: interrupt client during recovery mount delay ================================ 02:41:31 (1350294091)
      start mds service on fat-amd-2
      Starting mds1:   /dev/sdc5 /mnt/mds1
      Started lustre-MDT0000
      start ost1 service on fat-amd-3
      Starting ost1:   /dev/sdc5 /mnt/ost1
      Started lustre-OST0000
      mount lustre on /mnt/lustre.....
      Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre
      Stopping /mnt/mds1 (opts:) on fat-amd-2
      Stopping client /mnt/lustre (opts: -f)
      

      Stack trace on client:

      [ 5526.947537] umount          S ffff880316bb3170     0  7395   7009 0x00000080
      [ 5526.954596]  ffff8803136e57c8 0000000000000082 00000001004fdeea ffff88030af44560
      [ 5526.962037]  ffff8803136e5fd8 ffff8803136e5fd8 0000000000013840 0000000000013840
      [ 5526.969479]  ffff880323191720 ffff88030af44560 0000000000000000 0000000000000286
      [ 5526.976921] Call Trace:
      [ 5526.979396]  [<ffffffffa054a570>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
      [ 5526.986517]  [<ffffffff8147461a>] schedule_timeout+0xa7/0xde
      [ 5526.992168]  [<ffffffff81060b58>] ? process_timeout+0x0/0x10
      [ 5526.997829]  [<ffffffffa02ae761>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [ 5527.004550]  [<ffffffffa0555a9c>] ptlrpc_set_wait+0x2ec/0x8c0 [ptlrpc]
      [ 5527.011066]  [<ffffffff8104df76>] ? default_wake_function+0x0/0x14
      [ 5527.017270]  [<ffffffffa05560e8>] ptlrpc_queue_wait+0x78/0x230 [ptlrpc]
      [ 5527.023900]  [<ffffffffa05386c5>] ldlm_cli_enqueue+0x2f5/0x7b0 [ptlrpc]
      [ 5527.030528]  [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc]
      [ 5527.037408]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
      [ 5527.044186]  [<ffffffffa0744e55>] mdc_enqueue+0x505/0x1590 [mdc]
      [ 5527.050196]  [<ffffffffa02b9578>] ? libcfs_log_return+0x28/0x40 [libcfs]
      [ 5527.056885]  [<ffffffffa074609e>] ? mdc_revalidate_lock+0x1be/0x1d0 [mdc]
      [ 5527.063661]  [<ffffffffa0746270>] mdc_intent_lock+0x1c0/0x5c0 [mdc]
      [ 5527.069932]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
      [ 5527.076734]  [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc]
      [ 5527.083601]  [<ffffffffa09eed8b>] lmv_intent_lookup+0x3bb/0x11c0 [lmv]
      [ 5527.090136]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
      [ 5527.096913]  [<ffffffffa09f12f0>] lmv_intent_lock+0x310/0x370 [lmv]
      [ 5527.103190]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
      [ 5527.109982]  [<ffffffffa08e0944>] __ll_inode_revalidate_it+0x214/0xd90 [lustre]
      [ 5527.117295]  [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre]
      [ 5527.124084]  [<ffffffffa08e1764>] ll_inode_revalidate_it+0x44/0x1a0 [lustre]
      [ 5527.131136]  [<ffffffffa08e1903>] ll_getattr_it+0x43/0x170 [lustre]
      [ 5527.137408]  [<ffffffffa08e1a64>] ll_getattr+0x34/0x40 [lustre]
      [ 5527.143317]  [<ffffffff81125113>] vfs_getattr+0x45/0x63
      [ 5527.148535]  [<ffffffff8112517e>] vfs_fstatat+0x4d/0x63
      [ 5527.153751]  [<ffffffff811251cf>] vfs_stat+0x1b/0x1d
      [ 5527.158709]  [<ffffffff811252ce>] sys_newstat+0x1a/0x33
      [ 5527.163927]  [<ffffffff81129f89>] ? path_put+0x1f/0x23
      [ 5527.169059]  [<ffffffff8109fa08>] ? audit_syscall_entry+0x145/0x171
      [ 5527.175315]  [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b
      

      Info required for matching: conf-sanity 23a

      Attachments

        Issue Links

          Activity

            [LU-2181] failure conf-sanity test_23a: umount -f client hung in stat() when MDS down

            Closing bug, since the test exception is only enforced for SLES11SP2, and will naturally expire.

            adilger Andreas Dilger added a comment - Closing bug, since the test exception is only enforced for SLES11SP2, and will naturally expire.

            Andreas, the conf-sanity.sh exception as it is now only skips the test on SP2. it runs fine on SP3 and always has. That being the case there is no harm to leaving the exception in there forever. However it could indeed be removed as we no longer build or test SP2 in master.

            bogl Bob Glossman (Inactive) added a comment - Andreas, the conf-sanity.sh exception as it is now only skips the test on SP2. it runs fine on SP3 and always has. That being the case there is no harm to leaving the exception in there forever. However it could indeed be removed as we no longer build or test SP2 in master.

            Bob, any idea if this update has made it into SP2 or SP3 (whatever we are currently testing master on)? Then we could remove the exception from conf-sanity.sh.

            adilger Andreas Dilger added a comment - Bob, any idea if this update has made it into SP2 or SP3 (whatever we are currently testing master on)? Then we could remove the exception from conf-sanity.sh.

            Thanks for the info, Jay. I suspect the important part of that patch is the util-linux rpm. When that patch becomes part of a regular sles11 sp2 update and no longer needs to be specially downloaded and applied, we can probably go ahead and take out the test skips added to work around the problem. We've been waiting for that to happen.

            bogl Bob Glossman (Inactive) added a comment - Thanks for the info, Jay. I suspect the important part of that patch is the util-linux rpm. When that patch becomes part of a regular sles11 sp2 update and no longer needs to be specially downloaded and applied, we can probably go ahead and take out the test skips added to work around the problem. We've been waiting for that to happen.

            SUSE pointed me to this patch:
            http://download.novell.com/Download?buildid=G4nSHdRyeOI~

            The patch consists of 7 patches:
            libblkid1-2.19.1-6.33.35.1.x86_64.rpm
            libblkid1-32bit-2.19.1-6.33.35.1.x86_64.rpm
            libuuid1-2.19.1-6.33.35.1.x86_64.rpm
            libuuid1-32bit-2.19.1-6.33.35.1.x86_64.rpm
            util-linux-2.19.1-6.33.35.1.x86_64.rpm
            util-linux-lang-2.19.1-6.33.35.1.x86_64.rpm
            uuid-runtime-2.19.1-6.33.35.1.x86_64.rpm

            I am not sure I need all of them, but installed them anyway. With these rpm set installed, the test passed!

            jaylan Jay Lan (Inactive) added a comment - SUSE pointed me to this patch: http://download.novell.com/Download?buildid=G4nSHdRyeOI~ The patch consists of 7 patches: libblkid1-2.19.1-6.33.35.1.x86_64.rpm libblkid1-32bit-2.19.1-6.33.35.1.x86_64.rpm libuuid1-2.19.1-6.33.35.1.x86_64.rpm libuuid1-32bit-2.19.1-6.33.35.1.x86_64.rpm util-linux-2.19.1-6.33.35.1.x86_64.rpm util-linux-lang-2.19.1-6.33.35.1.x86_64.rpm uuid-runtime-2.19.1-6.33.35.1.x86_64.rpm I am not sure I need all of them, but installed them anyway. With these rpm set installed, the test passed!

            added subtest 45 to the sles11 sp2 skip list
            http://review.whamcloud.com/4884

            bogl Bob Glossman (Inactive) added a comment - added subtest 45 to the sles11 sp2 skip list http://review.whamcloud.com/4884

            I suspect Sarah hit this problem due to running tests with SLOW=yes. By default SLOW=no and test 45 gets skipped. Probably needs to be fixed by adding 45 to the sles11 always skip list at the top of conf-sanity.sh

            bogl Bob Glossman (Inactive) added a comment - I suspect Sarah hit this problem due to running tests with SLOW=yes. By default SLOW=no and test 45 gets skipped. Probably needs to be fixed by adding 45 to the sles11 always skip list at the top of conf-sanity.sh
            sarah Sarah Liu added a comment -

            conf-sanity test_45 also hit this error on sles11 sp2 client:
            https://maloo.whamcloud.com/test_sets/3ed08c6c-46dc-11e2-b16f-52540035b04c

            sarah Sarah Liu added a comment - conf-sanity test_45 also hit this error on sles11 sp2 client: https://maloo.whamcloud.com/test_sets/3ed08c6c-46dc-11e2-b16f-52540035b04c

            This bug needs to stay open for tracking until SLES11 has the fix to umount to remove the stat() call when -f is given.

            adilger Andreas Dilger added a comment - This bug needs to stay open for tracking until SLES11 has the fix to umount to remove the stat() call when -f is given.

            patch to disable the tests I know about for sles11 sp2:
            http://review.whamcloud.com/4639

            bogl Bob Glossman (Inactive) added a comment - patch to disable the tests I know about for sles11 sp2: http://review.whamcloud.com/4639

            People

              bogl Bob Glossman (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: