Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9908

conf-sanity test_41b: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.10.1
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c07d5048-8871-11e7-b93b-5254006e85c2.

      The sub-test test_41b failed with the following error:

      test failed to respond and timed out
      

      test hangs & fails during client umount of lustre. can't find root cause(s). have looked for OOPs or Panics with stack traces and can't find any.

      History search shows several similar fails on sles12sp2 recently.

      Info required for matching: conf-sanity 41b

      Attachments

        Issue Links

          Activity

            [LU-9908] conf-sanity test_41b: test failed to respond and timed out
            sarah Sarah Liu added a comment -

            another one on b2_10 branch 2.10.1 RC1 testing with SLES12sp2 client
            https://testing.hpdd.intel.com/test_sets/eaf41ec6-9c1f-11e7-b778-5254006e85c2

            sarah Sarah Liu added a comment - another one on b2_10 branch 2.10.1 RC1 testing with SLES12sp2 client https://testing.hpdd.intel.com/test_sets/eaf41ec6-9c1f-11e7-b778-5254006e85c2

            Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/29108
            Subject: LU-9908 tests: force umount client in test_70e & 41b
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 8d3a564a82a2cdef304638b339886f2de991bdca

            gerrit Gerrit Updater added a comment - Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/29108 Subject: LU-9908 tests: force umount client in test_70e & 41b Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 8d3a564a82a2cdef304638b339886f2de991bdca
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/d2071558-9ca6-11e7-b778-5254006e85c2

            the patch https://review.whamcloud.com/28767 changes (fixes?) test 70e, but does nothing for similar fails seen in test 41b.

            Here's another seen on b2_10 in 41b:
            https://testing.hpdd.intel.com/test_sets/dc983726-98dd-11e7-ba20-5254006e85c2

            bogl Bob Glossman (Inactive) added a comment - the patch https://review.whamcloud.com/28767 changes (fixes?) test 70e, but does nothing for similar fails seen in test 41b. Here's another seen on b2_10 in 41b: https://testing.hpdd.intel.com/test_sets/dc983726-98dd-11e7-ba20-5254006e85c2
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/8ec26fc2-935a-11e7-b74a-5254006e85c2

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/28767
            Subject: LU-9908 tests: force umount client in test_70e
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 77ef8e40b0457fac025475b99fb898dd5904a2fd

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/28767 Subject: LU-9908 tests: force umount client in test_70e Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 77ef8e40b0457fac025475b99fb898dd5904a2fd
            bogl Bob Glossman (Inactive) added a comment - this fail seems to be reproducing. Here's another one: https://testing.hpdd.intel.com/test_sets/515eed38-8c23-11e7-b94a-5254006e85c2
            ys Yang Sheng added a comment -

            Looks like client hang:

            19:32:01:[15548.934823] Leftover inexact backtrace:
            19:32:01:[15548.934823] 
            19:32:01:[15548.934826] umount          S 0000000000000000     0 22377  22376 0x00000000
            19:32:01:[15548.934827]  ffff88007ae87a78 ffff8800641a1300 ffff88007c1b5800 ffff88007ae88000
            19:32:01:[15548.934828]  ffff88007ae87ab0 00000001003a2c78 ffff88007fc0e040 0000000000000000
            19:32:01:[15548.934829]  ffff88007ae87a90 ffffffff815e4c45 ffff88007fc0e040 ffff88007ae87b38
            19:32:01:[15548.934830] Call Trace:
            19:32:01:[15548.934832]  [<ffffffff815e4c45>] schedule+0x35/0x80
            19:32:01:[15548.934833]  [<ffffffff815e74d3>] schedule_timeout+0x163/0x2d0
            19:32:01:[15548.934857]  [<ffffffffa0994c7b>] ptlrpc_set_wait+0x1cb/0x850 [ptlrpc]
            19:32:01:[15548.934881]  [<ffffffffa0995378>] ptlrpc_queue_wait+0x78/0x210 [ptlrpc]
            19:32:01:[15548.934889]  [<ffffffffa0ac851b>] mdc_statfs+0xab/0x2e0 [mdc]
            19:32:01:[15548.934898]  [<ffffffffa092a1ce>] lmv_statfs+0x26e/0xa30 [lmv]
            19:32:01:[15548.934917]  [<ffffffffa0c3bbeb>] ll_statfs_internal+0xeb/0xe00 [lustre]
            19:32:01:[15548.934929]  [<ffffffffa0c3c97b>] ll_statfs+0x7b/0x160 [lustre]
            19:32:01:[15548.934932]  [<ffffffff8122dc13>] statfs_by_dentry+0x93/0x110
            19:32:01:[15548.934935]  [<ffffffff8122dca6>] vfs_statfs+0x16/0xb0
            19:32:01:[15548.934937]  [<ffffffff8122dd80>] user_statfs+0x40/0x70
            19:32:01:[15548.934939]  [<ffffffff8122ddc0>] SYSC_statfs+0x10/0x30
            19:32:01:[15548.934941]  [<ffffffff815e872e>] entry_SYSCALL_64_fastpath+0x12/0x6d
            19:32:01:[15548.936314] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d
            19:32:01:[15548.936314] 
            
            ys Yang Sheng added a comment - Looks like client hang: 19:32:01:[15548.934823] Leftover inexact backtrace: 19:32:01:[15548.934823] 19:32:01:[15548.934826] umount S 0000000000000000 0 22377 22376 0x00000000 19:32:01:[15548.934827] ffff88007ae87a78 ffff8800641a1300 ffff88007c1b5800 ffff88007ae88000 19:32:01:[15548.934828] ffff88007ae87ab0 00000001003a2c78 ffff88007fc0e040 0000000000000000 19:32:01:[15548.934829] ffff88007ae87a90 ffffffff815e4c45 ffff88007fc0e040 ffff88007ae87b38 19:32:01:[15548.934830] Call Trace: 19:32:01:[15548.934832] [<ffffffff815e4c45>] schedule+0x35/0x80 19:32:01:[15548.934833] [<ffffffff815e74d3>] schedule_timeout+0x163/0x2d0 19:32:01:[15548.934857] [<ffffffffa0994c7b>] ptlrpc_set_wait+0x1cb/0x850 [ptlrpc] 19:32:01:[15548.934881] [<ffffffffa0995378>] ptlrpc_queue_wait+0x78/0x210 [ptlrpc] 19:32:01:[15548.934889] [<ffffffffa0ac851b>] mdc_statfs+0xab/0x2e0 [mdc] 19:32:01:[15548.934898] [<ffffffffa092a1ce>] lmv_statfs+0x26e/0xa30 [lmv] 19:32:01:[15548.934917] [<ffffffffa0c3bbeb>] ll_statfs_internal+0xeb/0xe00 [lustre] 19:32:01:[15548.934929] [<ffffffffa0c3c97b>] ll_statfs+0x7b/0x160 [lustre] 19:32:01:[15548.934932] [<ffffffff8122dc13>] statfs_by_dentry+0x93/0x110 19:32:01:[15548.934935] [<ffffffff8122dca6>] vfs_statfs+0x16/0xb0 19:32:01:[15548.934937] [<ffffffff8122dd80>] user_statfs+0x40/0x70 19:32:01:[15548.934939] [<ffffffff8122ddc0>] SYSC_statfs+0x10/0x30 19:32:01:[15548.934941] [<ffffffff815e872e>] entry_SYSCALL_64_fastpath+0x12/0x6d 19:32:01:[15548.936314] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d 19:32:01:[15548.936314]
            bogl Bob Glossman (Inactive) added a comment - - edited

            fails like this are happening in more places than test 41b.
            here's a similar looking fail in test 70e:
            https://testing.hpdd.intel.com/test_sets/b3eda294-8905-11e7-b45f-5254006e85c2

            once again it hangs during a client umount.
            can't find any panic or oops on any node.
            autotest times out an hour later and kills things.

            since conf-sanity on sles12sp2 is tested so little this failure may have been lurking for quite a long time.

            bogl Bob Glossman (Inactive) added a comment - - edited fails like this are happening in more places than test 41b. here's a similar looking fail in test 70e: https://testing.hpdd.intel.com/test_sets/b3eda294-8905-11e7-b45f-5254006e85c2 once again it hangs during a client umount. can't find any panic or oops on any node. autotest times out an hour later and kills things. since conf-sanity on sles12sp2 is tested so little this failure may have been lurking for quite a long time.

            this failure didn't reproduce on retest, so it's not a 100% fail.
            may still be a high rate fail on sles12 though.

            bogl Bob Glossman (Inactive) added a comment - this failure didn't reproduce on retest, so it's not a 100% fail. may still be a high rate fail on sles12 though.

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: