[LU-4413] Test failure on test suite conf-sanity, subtest test_56 Created: 23/Dec/13  Updated: 29/May/14  Resolved: 20/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Major
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: HB, mn4

Issue Links:
Duplicate
duplicates LU-4508 conf-sanity test_58 test_72 (umount /... Closed
is duplicated by LU-4349 conf-sanity test_47: test failed to r... Resolved
Related
is related to LU-4349 conf-sanity test_47: test failed to r... Resolved
is related to LU-5107 MDS oops during mount with latest lus... Resolved
Severity: 3
Rank (Obsolete): 12112

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/d5f1f7c2-6a01-11e3-9248-52540035b04c
https://maloo.whamcloud.com/test_sets/e4cbde82-6a35-11e3-8e21-52540035b04c

The sub-test test_56 failed with the following error:

test failed to respond and timed out

Info required for matching: conf-sanity 56



 Comments   
Comment by Nathaniel Clark [ 23/Dec/13 ]

Seems most prevalent during review-zfs but also occurred during review-dne:
https://maloo.whamcloud.com/test_sets/57ec0c52-6874-11e3-9675-52540035b04c

Comment by Jodi Levi (Inactive) [ 08/Jan/14 ]

Nathaniel,
Are you still seeing this issue?

Comment by Zhenyu Xu [ 16/Jan/14 ]

another hit: https://maloo.whamcloud.com/test_sets/dcc8e2b2-7e48-11e3-908b-52540035b04c

Comment by Andreas Dilger [ 17/Jan/14 ]

This was hit 15 times in the past 4 weeks.

Comment by Nathaniel Clark [ 24/Jan/14 ]

Seeing this on review-dne a lot (like LU-4508)
conf-sanity/72 https://maloo.whamcloud.com/test_sets/e70aef8c-8464-11e3-bab5-52540035b04c
conf-sanity/58 https://maloo.whamcloud.com/test_sets/990ad45a-84ff-11e3-86af-52540035b04c
conf-sanity/56 https://maloo.whamcloud.com/test_sets/6498f574-84cf-11e3-8da9-52540035b04c
https://maloo.whamcloud.com/test_sets/6498f574-84cf-11e3-8da9-52540035b04c

Comment by Andreas Dilger [ 24/Jan/14 ]

If you do a search for TIMEOUT on conf-sanity for master, there are problems in conf-sanity on many different tests - 58, 160a, 17n, 72, 80f, 56, 12, 28a, 70b, 90, 161a, 47, 0, 2b, 12, 50h, 230d, 133f, just on a single page of results.

This is a major problem and needs to be investigated as the highest priority.

Comment by Andreas Dilger [ 24/Jan/14 ]

Update from Di on this topic:

[1/24/2014, 2:08:11 PM] Wang Di:  here is the amount process stack
[1/24/2014, 2:08:13 PM] Wang Di: umount        S 0000000000000000     0 25182  25181 0x00000080
 ffff8800647cdc68 0000000000000086 0000000000000000 ffffffff8150f3fa
 00000000ffffff0a 0000000000000282 0000000000000000 dead000000200200
 ffff880037eb2638 ffff8800647cdfd8 000000000000fb88 ffff880037eb2638
Call Trace:
 [<ffffffff8150f3fa>] ? schedule_timeout+0x19a/0x2e0
 [<ffffffffa0b6326d>] ptlrpc_disconnect_import+0x77d/0x7a0 [ptlrpc]
 [<ffffffff81063990>] ? default_wake_function+0x0/0x20
 [<ffffffffa06ea4e8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0df94e9>] lwp_disconnect+0x69/0x160 [osp]
 [<ffffffffa0dfb250>] lwp_process_config+0x150/0x260 [osp]
 [<ffffffffa0937e85>] lustre_disconnect_lwp+0x1c5/0xe40 [obdclass]
 [<ffffffffa093ec40>] server_put_super+0x630/0xf60 [obdclass]
 [<ffffffff8119d906>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff81183756>] kill_anon_super+0x16/0x60
 [<ffffffffa0906506>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff81183ef7>] deactivate_super+0x57/0x80
 [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110
 [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

It seems lwp is waiting for recovery, but lwp should not be in recovery states at all. I think the problem is the state of LWP should only be FULL, CLOSED, DISCONN, which will make sure the lwp will not be in recovery states, but somehow some one change the state of LWP to some other states. trying to figure out why.

Comment by Andreas Dilger [ 25/Jan/14 ]

Di, would it be possible to just change ptlrpc_disconnect_import() to not wait for recovery if obd_no_recov is set (this is done in ptlrpc_pinger_del_import())? Otherwise, it seems LWP connections will never reconnect because the import was removed from the pinger via ptlrpc_pinger_del_import() in lwp_disconnect(). That means nothing is going to try and reconnect the import and it will just wait there forever.

I've pushed http://review.whamcloud.com/8996 which may or may not fix this problem. I think what is happening is that the MDS is trying to unmount while it still has a failed LWP connection to another MDS that was recently stopped. It is racy if that connection has recovered yet or not, but if it hasn't then lwp_disconnect->ptlrpc_pinger_del_import() mean that the connection will never be recovered, and ptlrpc_disconnect_import() will wait forever for that to happen.

Comment by Di Wang [ 25/Jan/14 ]

Yes, I can see the from the debug log that it is a racy condition, i.e. MDS2 is doing umount, while its lwp connection is not been setup MDS1 (import is in CONNECTING state), at the mean time MDS1 has been umounted. But here is one thing I do not understand.

1. MDS2 is trying to setup connection with MDS1, which mean there must be a connecting RPC inflight or somewhere in ptlrpcd.
2. and before lwp_disconnect call ptlrpc_disconnect_import, it suppose to do ptlrpc_deactivate_import, which will drain all of RPC related with this import IMHO, then the connect interpret callback should set the import to be DISCONNECTED. Then ptlrpc_disconnect_import should be wakeup instead of waiting for non-recovery state.

If what I understand is correct, then we probably need find out why the connecting RPC is not being interrupted during the umount process.

Comment by Di Wang [ 25/Jan/14 ]

Ah, the connecting RPC is being to ptlrpcd_rcv thread, but ptlrpcd_rcv is stuck at

ptlrpcd_rcv   S 0000000000000001     0 15409      2 0x00000080
 ffff88005ef2b8d0 0000000000000046 0000000000000000 00000000fffffffe
 ffff88005ef2b890 0000000000000000 ffff880079ddea80 ffff880079ddeab0
 ffff8800377a7af8 ffff88005ef2bfd8 000000000000fb88 ffff8800377a7af8
Call Trace:
 [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0
 [<ffffffff810811e0>] ? process_timeout+0x0/0x10
 [<ffffffffa0b3f99a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
 [<ffffffff81063990>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b49c06>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
 [<ffffffffa0b3ffa7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
 [<ffffffffa0292bea>] seq_client_rpc+0x12a/0x910 [fid]
 [<ffffffffa06efa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa02937cf>] seq_client_alloc_seq+0x3ff/0x480 [fid]
 [<ffffffffa02929c3>] ? seq_fid_alloc_prep+0x43/0xc0 [fid]
 [<ffffffffa02938c5>] seq_client_get_seq+0x75/0x1f0 [fid]
 [<ffffffff81063990>] ? default_wake_function+0x0/0x20
 [<ffffffffa0914263>] ? lu_context_init+0xa3/0x240 [obdclass]
 [<ffffffffa0debe12>] osp_init_pre_fid+0x3c2/0x500 [osp]
 [<ffffffffa0de1b3f>] osp_import_event+0xaf/0x4e0 [osp]
 [<ffffffffa0b63524>] ptlrpc_activate_import+0xb4/0x260 [ptlrpc]
 [<ffffffffa0b68d22>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc]
 [<ffffffffa0b3de44>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
 [<ffffffffa0b6a4cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
 [<ffffffffa0b6ab1b>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
 [<ffffffff81063990>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b6a7e0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

So the whole process looks like a deadlock to me

1. MDS1 is being amounted
2. LWP on MDS2 is trying to re-connect to MDS1, which will add Connecting RPC to ptlrpcd_rcv, and set import to CONNECTING.
3. umount MDS2, LWP will try to abort all of RPC related to the import. But the ptlrpcd_rcv is stuck as above, i.e. MDS2 is trying to allocate sequence from one of OST, but the OST is not able to get super sequence from MDT0, because MDT0 has been umounted, so the RPC is stuck there as above.

4. Since the connecting RPC does not get the chance to be handled in ptlrpcd_rcv, then the import of LWP will be connecting for ever.

So ptlrpc_disconnect_import will wait the recovering import forever.

Comment by Di Wang [ 25/Jan/14 ]

http://review.whamcloud.com/8997

Comment by nasf (Inactive) [ 26/Jan/14 ]

What happened if the OST try to get sequence from MDT0 which has been amount? It will block or return failure. I think the later is reasonable. If OST got failure from MDT0, it should forward the failure back to the MDS2, so the MDS2 also should not be blocked for ever. Missed anything?

Comment by Di Wang [ 26/Jan/14 ]

Hmm, right now, OST can not tell whether MDT0 is being umounted or not being setup yet. So OST will retry if the request return -ENOTREADY or EBUSY. But I agree something needs to be fixed here, probably not in this patch.

Comment by Jodi Levi (Inactive) [ 20/Mar/14 ]

Patches landed to Master. New tickets will be opened if additional fixes are required.

Generated at Sat Feb 10 01:42:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.