[LU-9908] conf-sanity test_41b: test failed to respond and timed out Created: 24/Aug/17  Updated: 24/Oct/17  Resolved: 16/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocking LU-9469 conf-sanity test_61: test failed to r... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c07d5048-8871-11e7-b93b-5254006e85c2.

The sub-test test_41b failed with the following error:

test failed to respond and timed out

test hangs & fails during client umount of lustre. can't find root cause(s). have looked for OOPs or Panics with stack traces and can't find any.

History search shows several similar fails on sles12sp2 recently.

Info required for matching: conf-sanity 41b



 Comments   
Comment by Bob Glossman (Inactive) [ 24/Aug/17 ]

this failure didn't reproduce on retest, so it's not a 100% fail.
may still be a high rate fail on sles12 though.

Comment by Bob Glossman (Inactive) [ 24/Aug/17 ]

fails like this are happening in more places than test 41b.
here's a similar looking fail in test 70e:
https://testing.hpdd.intel.com/test_sets/b3eda294-8905-11e7-b45f-5254006e85c2

once again it hangs during a client umount.
can't find any panic or oops on any node.
autotest times out an hour later and kills things.

since conf-sanity on sles12sp2 is tested so little this failure may have been lurking for quite a long time.

Comment by Yang Sheng [ 25/Aug/17 ]

Looks like client hang:

19:32:01:[15548.934823] Leftover inexact backtrace:
19:32:01:[15548.934823] 
19:32:01:[15548.934826] umount          S 0000000000000000     0 22377  22376 0x00000000
19:32:01:[15548.934827]  ffff88007ae87a78 ffff8800641a1300 ffff88007c1b5800 ffff88007ae88000
19:32:01:[15548.934828]  ffff88007ae87ab0 00000001003a2c78 ffff88007fc0e040 0000000000000000
19:32:01:[15548.934829]  ffff88007ae87a90 ffffffff815e4c45 ffff88007fc0e040 ffff88007ae87b38
19:32:01:[15548.934830] Call Trace:
19:32:01:[15548.934832]  [<ffffffff815e4c45>] schedule+0x35/0x80
19:32:01:[15548.934833]  [<ffffffff815e74d3>] schedule_timeout+0x163/0x2d0
19:32:01:[15548.934857]  [<ffffffffa0994c7b>] ptlrpc_set_wait+0x1cb/0x850 [ptlrpc]
19:32:01:[15548.934881]  [<ffffffffa0995378>] ptlrpc_queue_wait+0x78/0x210 [ptlrpc]
19:32:01:[15548.934889]  [<ffffffffa0ac851b>] mdc_statfs+0xab/0x2e0 [mdc]
19:32:01:[15548.934898]  [<ffffffffa092a1ce>] lmv_statfs+0x26e/0xa30 [lmv]
19:32:01:[15548.934917]  [<ffffffffa0c3bbeb>] ll_statfs_internal+0xeb/0xe00 [lustre]
19:32:01:[15548.934929]  [<ffffffffa0c3c97b>] ll_statfs+0x7b/0x160 [lustre]
19:32:01:[15548.934932]  [<ffffffff8122dc13>] statfs_by_dentry+0x93/0x110
19:32:01:[15548.934935]  [<ffffffff8122dca6>] vfs_statfs+0x16/0xb0
19:32:01:[15548.934937]  [<ffffffff8122dd80>] user_statfs+0x40/0x70
19:32:01:[15548.934939]  [<ffffffff8122ddc0>] SYSC_statfs+0x10/0x30
19:32:01:[15548.934941]  [<ffffffff815e872e>] entry_SYSCALL_64_fastpath+0x12/0x6d
19:32:01:[15548.936314] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d
19:32:01:[15548.936314] 
Comment by Bob Glossman (Inactive) [ 28/Aug/17 ]

this fail seems to be reproducing. Here's another one:
https://testing.hpdd.intel.com/test_sets/515eed38-8c23-11e7-b94a-5254006e85c2

Comment by Gerrit Updater [ 28/Aug/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/28767
Subject: LU-9908 tests: force umount client in test_70e
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 77ef8e40b0457fac025475b99fb898dd5904a2fd

Comment by Bob Glossman (Inactive) [ 07/Sep/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/8ec26fc2-935a-11e7-b74a-5254006e85c2

Comment by Bob Glossman (Inactive) [ 14/Sep/17 ]

the patch https://review.whamcloud.com/28767 changes (fixes?) test 70e, but does nothing for similar fails seen in test 41b.

Here's another seen on b2_10 in 41b:
https://testing.hpdd.intel.com/test_sets/dc983726-98dd-11e7-ba20-5254006e85c2

Comment by Bob Glossman (Inactive) [ 18/Sep/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/d2071558-9ca6-11e7-b778-5254006e85c2

Comment by Gerrit Updater [ 20/Sep/17 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/29108
Subject: LU-9908 tests: force umount client in test_70e & 41b
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 8d3a564a82a2cdef304638b339886f2de991bdca

Comment by Sarah Liu [ 20/Sep/17 ]

another one on b2_10 branch 2.10.1 RC1 testing with SLES12sp2 client
https://testing.hpdd.intel.com/test_sets/eaf41ec6-9c1f-11e7-b778-5254006e85c2

Comment by Gerrit Updater [ 16/Oct/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28767/
Subject: LU-9908 tests: force umount client in test 70e, 41b, and 105
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 20787a89ad7dc99290c4d9cff8247b0a532e92b9

Comment by Peter Jones [ 16/Oct/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 24/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29108/
Subject: LU-9908 tests: force umount client in test 70e, 41b, and 105
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 1361b447bb68757f6f41ce7d5cbec954c727c31d

Generated at Sat Feb 10 02:30:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.