[LU-2181] failure conf-sanity test_23a: umount -f client hung in stat() when MDS down Created: 15/Oct/12 Updated: 13/Apr/23 Resolved: 01/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5219 | ||||||||
| Description |
|
This issue was created by maloo for yujian <yujian@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/140674a6-16b2-11e2-962d-52540035b04c. Lustre Tag: v2_3_0_RC3 The sub-test test_23a hung at unmounting the client: == conf-sanity test 23a: interrupt client during recovery mount delay ================================ 02:41:31 (1350294091) start mds service on fat-amd-2 Starting mds1: /dev/sdc5 /mnt/mds1 Started lustre-MDT0000 start ost1 service on fat-amd-3 Starting ost1: /dev/sdc5 /mnt/ost1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre Stopping /mnt/mds1 (opts:) on fat-amd-2 Stopping client /mnt/lustre (opts: -f) Stack trace on client: [ 5526.947537] umount S ffff880316bb3170 0 7395 7009 0x00000080 [ 5526.954596] ffff8803136e57c8 0000000000000082 00000001004fdeea ffff88030af44560 [ 5526.962037] ffff8803136e5fd8 ffff8803136e5fd8 0000000000013840 0000000000013840 [ 5526.969479] ffff880323191720 ffff88030af44560 0000000000000000 0000000000000286 [ 5526.976921] Call Trace: [ 5526.979396] [<ffffffffa054a570>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [ 5526.986517] [<ffffffff8147461a>] schedule_timeout+0xa7/0xde [ 5526.992168] [<ffffffff81060b58>] ? process_timeout+0x0/0x10 [ 5526.997829] [<ffffffffa02ae761>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [ 5527.004550] [<ffffffffa0555a9c>] ptlrpc_set_wait+0x2ec/0x8c0 [ptlrpc] [ 5527.011066] [<ffffffff8104df76>] ? default_wake_function+0x0/0x14 [ 5527.017270] [<ffffffffa05560e8>] ptlrpc_queue_wait+0x78/0x230 [ptlrpc] [ 5527.023900] [<ffffffffa05386c5>] ldlm_cli_enqueue+0x2f5/0x7b0 [ptlrpc] [ 5527.030528] [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc] [ 5527.037408] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.044186] [<ffffffffa0744e55>] mdc_enqueue+0x505/0x1590 [mdc] [ 5527.050196] [<ffffffffa02b9578>] ? libcfs_log_return+0x28/0x40 [libcfs] [ 5527.056885] [<ffffffffa074609e>] ? mdc_revalidate_lock+0x1be/0x1d0 [mdc] [ 5527.063661] [<ffffffffa0746270>] mdc_intent_lock+0x1c0/0x5c0 [mdc] [ 5527.069932] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.076734] [<ffffffffa0536d90>] ? ldlm_completion_ast+0x0/0x6f0 [ptlrpc] [ 5527.083601] [<ffffffffa09eed8b>] lmv_intent_lookup+0x3bb/0x11c0 [lmv] [ 5527.090136] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.096913] [<ffffffffa09f12f0>] lmv_intent_lock+0x310/0x370 [lmv] [ 5527.103190] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.109982] [<ffffffffa08e0944>] __ll_inode_revalidate_it+0x214/0xd90 [lustre] [ 5527.117295] [<ffffffffa0905cc0>] ? ll_md_blocking_ast+0x0/0x710 [lustre] [ 5527.124084] [<ffffffffa08e1764>] ll_inode_revalidate_it+0x44/0x1a0 [lustre] [ 5527.131136] [<ffffffffa08e1903>] ll_getattr_it+0x43/0x170 [lustre] [ 5527.137408] [<ffffffffa08e1a64>] ll_getattr+0x34/0x40 [lustre] [ 5527.143317] [<ffffffff81125113>] vfs_getattr+0x45/0x63 [ 5527.148535] [<ffffffff8112517e>] vfs_fstatat+0x4d/0x63 [ 5527.153751] [<ffffffff811251cf>] vfs_stat+0x1b/0x1d [ 5527.158709] [<ffffffff811252ce>] sys_newstat+0x1a/0x33 [ 5527.163927] [<ffffffff81129f89>] ? path_put+0x1f/0x23 [ 5527.169059] [<ffffffff8109fa08>] ? audit_syscall_entry+0x145/0x171 [ 5527.175315] [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b Info required for matching: conf-sanity 23a |
| Comments |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC3 conf-sanity test 34b also hit the same issue: |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC3 conf-sanity test 45 also hit the same issue: |
| Comment by Peter Jones [ 15/Oct/12 ] |
|
Yangsheng is looking into this one |
| Comment by Yang Sheng [ 15/Oct/12 ] |
|
From first investigate, looks like client loop forever in ptlrpc_set_wait(). maybe relate to commit 7223b4746c71bf450c178ed21ddf99a0e3e26a1a. But i am still not understand why it just impact FC15 client. |
| Comment by Sarah Liu [ 03/Nov/12 ] |
|
Found another failure in SLES11 SP2 client. client trace shows: [20995.079218] umount S 0000000000000001 0 6319 5616 0x00000000 [20995.079218] ffff880062751878 0000000000000082 0000000000000000 ffff880062750010 [20995.079218] 0000000000011180 0000000000011180 ffff880062751fd8 ffff880062751fd8 [20995.079218] 0000000000011180 ffff88006c426340 ffff880000000028 ffff880026852440 [20995.079218] Call Trace: [20995.079218] [<ffffffff81441570>] schedule_timeout+0x1b0/0x2a0 [20995.079218] [<ffffffffa080bfbb>] ptlrpc_set_wait+0x2eb/0x920 [ptlrpc] [20995.079218] [<ffffffffa080c666>] ptlrpc_queue_wait+0x76/0x230 [ptlrpc] [20995.079218] [<ffffffffa07ef83b>] ldlm_cli_enqueue+0x23b/0x790 [ptlrpc] [20995.079218] [<ffffffffa09928c8>] mdc_enqueue+0x2b8/0xcb0 [mdc] [20995.079218] [<ffffffffa0993468>] mdc_intent_lock+0x1a8/0xd40 [mdc] [20995.079218] [<ffffffffa09497a7>] lmv_intent_lookup+0x3b7/0x11b0 [lmv] [20995.079218] [<ffffffffa094a8e2>] lmv_intent_lock+0x342/0x3a0 [lmv] [20995.079218] [<ffffffffa0b17ea0>] __ll_inode_revalidate_it+0x550/0xd70 [lustre] [20995.079218] [<ffffffffa0b18959>] ll_inode_revalidate_it+0x49/0x1c0 [lustre] [20995.079218] [<ffffffffa0b18b0e>] ll_getattr_it+0x3e/0x160 [lustre] [20995.079218] [<ffffffffa0b18c5f>] ll_getattr+0x2f/0x40 [lustre] [20995.079218] [<ffffffff81154631>] vfs_fstatat+0x81/0x90 [20995.079218] [<ffffffff8115478f>] sys_newstat+0x1f/0x50 [20995.079218] [<ffffffff8144ac92>] system_call_fastpath+0x16/0x1b [20995.079218] [<00007f60d1ef9335>] 0x7f60d1ef9334 |
| Comment by Bob Glossman (Inactive) [ 20/Nov/12 ] |
|
I have seen this repeatedly in sles11 sp2 clients too, so it's not just FC15 client. I'm very suspicious that this may be due to version skew in the umount command. In el6 which doesn't fail umount is found in util-linux-ng 2.17.2, in sles11 sp2 it is in util-linux 2.19.1, don't know what the version is in FC15. Running strace on a 'umount -f' with mds down or unreachable, in el6 I see the first significant syscall is umount(): .... ) = 0 In sles11 sp2 the first significant syscall is stat() on the mount point: .... It appears to be the stat() call of the mount point that gets into the permanent client loop YS has described and the umount command never gets to the umount() syscall. In earlier versions of the umount command, as seen in el6 clients, there is no stat() call and the umount() call succeeds even with the mds down. |
| Comment by Andreas Dilger [ 20/Nov/12 ] |
|
Bob, the util-linux sources are at https://github.com/karelzak/util-linux and would be a good place to investigate where/why the stat() call was added to the code before umount() is called. As I was typing this, Oleg already checked this and found the following patch which fixes the problem: https://github.com/karelzak/util-linux/commit/6d5d2b5fd342308bb2db6b82d89ac22ef7374184 We need to get this patch into the upstream SLES release if it isn't already. Peter will be contacting SuSE about this. I'll contact Red Hat about the same. In the meantime, can you please add this subtest to the ALWAYS_EXCEPT list for SLES11 SP2. |
| Comment by Bob Glossman (Inactive) [ 20/Nov/12 ] |
|
Andreas, I don't think just adding this subtest to ALWAYS_EXCEPT is sufficient. I know of at least one other test that does umount with mds down, conf-sanity 34b, and there are probably others I don't know immediately. |
| Comment by Bob Glossman (Inactive) [ 20/Nov/12 ] |
|
patch to disable the tests I know about for sles11 sp2: |
| Comment by Andreas Dilger [ 27/Nov/12 ] |
|
This bug needs to stay open for tracking until SLES11 has the fix to umount to remove the stat() call when -f is given. |
| Comment by Sarah Liu [ 18/Dec/12 ] |
|
conf-sanity test_45 also hit this error on sles11 sp2 client: |
| Comment by Bob Glossman (Inactive) [ 18/Dec/12 ] |
|
I suspect Sarah hit this problem due to running tests with SLOW=yes. By default SLOW=no and test 45 gets skipped. Probably needs to be fixed by adding 45 to the sles11 always skip list at the top of conf-sanity.sh |
| Comment by Bob Glossman (Inactive) [ 21/Dec/12 ] |
|
added subtest 45 to the sles11 sp2 skip list |
| Comment by Jay Lan (Inactive) [ 14/Jan/13 ] |
|
SUSE pointed me to this patch: The patch consists of 7 patches: I am not sure I need all of them, but installed them anyway. With these rpm set installed, the test passed! |
| Comment by Bob Glossman (Inactive) [ 14/Jan/13 ] |
|
Thanks for the info, Jay. I suspect the important part of that patch is the util-linux rpm. When that patch becomes part of a regular sles11 sp2 update and no longer needs to be specially downloaded and applied, we can probably go ahead and take out the test skips added to work around the problem. We've been waiting for that to happen. |
| Comment by Andreas Dilger [ 01/Oct/14 ] |
|
Bob, any idea if this update has made it into SP2 or SP3 (whatever we are currently testing master on)? Then we could remove the exception from conf-sanity.sh. |
| Comment by Bob Glossman (Inactive) [ 01/Oct/14 ] |
|
Andreas, the conf-sanity.sh exception as it is now only skips the test on SP2. it runs fine on SP3 and always has. That being the case there is no harm to leaving the exception in there forever. However it could indeed be removed as we no longer build or test SP2 in master. |
| Comment by Andreas Dilger [ 01/Oct/14 ] |
|
Closing bug, since the test exception is only enforced for SLES11SP2, and will naturally expire. |