[LU-4413] Test failure on test suite conf-sanity, subtest test_56 Created: 23/Dec/13 Updated: 29/May/14 Resolved: 20/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, mn4 | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 12112 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_56 failed with the following error:
Info required for matching: conf-sanity 56 |
| Comments |
| Comment by Nathaniel Clark [ 23/Dec/13 ] |
|
Seems most prevalent during review-zfs but also occurred during review-dne: |
| Comment by Jodi Levi (Inactive) [ 08/Jan/14 ] |
|
Nathaniel, |
| Comment by Zhenyu Xu [ 16/Jan/14 ] |
|
another hit: https://maloo.whamcloud.com/test_sets/dcc8e2b2-7e48-11e3-908b-52540035b04c |
| Comment by Andreas Dilger [ 17/Jan/14 ] |
|
This was hit 15 times in the past 4 weeks. |
| Comment by Nathaniel Clark [ 24/Jan/14 ] |
|
Seeing this on review-dne a lot (like |
| Comment by Andreas Dilger [ 24/Jan/14 ] |
|
If you do a search for TIMEOUT on conf-sanity for master, there are problems in conf-sanity on many different tests - 58, 160a, 17n, 72, 80f, 56, 12, 28a, 70b, 90, 161a, 47, 0, 2b, 12, 50h, 230d, 133f, just on a single page of results. This is a major problem and needs to be investigated as the highest priority. |
| Comment by Andreas Dilger [ 24/Jan/14 ] |
|
Update from Di on this topic:
|
| Comment by Andreas Dilger [ 25/Jan/14 ] |
|
Di, would it be possible to just change ptlrpc_disconnect_import() to not wait for recovery if obd_no_recov is set (this is done in ptlrpc_pinger_del_import())? Otherwise, it seems LWP connections will never reconnect because the import was removed from the pinger via ptlrpc_pinger_del_import() in lwp_disconnect(). That means nothing is going to try and reconnect the import and it will just wait there forever. I've pushed http://review.whamcloud.com/8996 which may or may not fix this problem. I think what is happening is that the MDS is trying to unmount while it still has a failed LWP connection to another MDS that was recently stopped. It is racy if that connection has recovered yet or not, but if it hasn't then lwp_disconnect->ptlrpc_pinger_del_import() mean that the connection will never be recovered, and ptlrpc_disconnect_import() will wait forever for that to happen. |
| Comment by Di Wang [ 25/Jan/14 ] |
|
Yes, I can see the from the debug log that it is a racy condition, i.e. MDS2 is doing umount, while its lwp connection is not been setup MDS1 (import is in CONNECTING state), at the mean time MDS1 has been umounted. But here is one thing I do not understand. 1. MDS2 is trying to setup connection with MDS1, which mean there must be a connecting RPC inflight or somewhere in ptlrpcd. If what I understand is correct, then we probably need find out why the connecting RPC is not being interrupted during the umount process. |
| Comment by Di Wang [ 25/Jan/14 ] |
|
Ah, the connecting RPC is being to ptlrpcd_rcv thread, but ptlrpcd_rcv is stuck at ptlrpcd_rcv S 0000000000000001 0 15409 2 0x00000080 ffff88005ef2b8d0 0000000000000046 0000000000000000 00000000fffffffe ffff88005ef2b890 0000000000000000 ffff880079ddea80 ffff880079ddeab0 ffff8800377a7af8 ffff88005ef2bfd8 000000000000fb88 ffff8800377a7af8 Call Trace: [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0 [<ffffffff810811e0>] ? process_timeout+0x0/0x10 [<ffffffffa0b3f99a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0b49c06>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0b3ffa7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa0292bea>] seq_client_rpc+0x12a/0x910 [fid] [<ffffffffa06efa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa02937cf>] seq_client_alloc_seq+0x3ff/0x480 [fid] [<ffffffffa02929c3>] ? seq_fid_alloc_prep+0x43/0xc0 [fid] [<ffffffffa02938c5>] seq_client_get_seq+0x75/0x1f0 [fid] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0914263>] ? lu_context_init+0xa3/0x240 [obdclass] [<ffffffffa0debe12>] osp_init_pre_fid+0x3c2/0x500 [osp] [<ffffffffa0de1b3f>] osp_import_event+0xaf/0x4e0 [osp] [<ffffffffa0b63524>] ptlrpc_activate_import+0xb4/0x260 [ptlrpc] [<ffffffffa0b68d22>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc] [<ffffffffa0b3de44>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] [<ffffffffa0b6a4cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc] [<ffffffffa0b6ab1b>] ptlrpcd+0x33b/0x3f0 [ptlrpc] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0b6a7e0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 So the whole process looks like a deadlock to me 1. MDS1 is being amounted 4. Since the connecting RPC does not get the chance to be handled in ptlrpcd_rcv, then the import of LWP will be connecting for ever. So ptlrpc_disconnect_import will wait the recovering import forever. |
| Comment by Di Wang [ 25/Jan/14 ] |
| Comment by nasf (Inactive) [ 26/Jan/14 ] |
|
What happened if the OST try to get sequence from MDT0 which has been amount? It will block or return failure. I think the later is reasonable. If OST got failure from MDT0, it should forward the failure back to the MDS2, so the MDS2 also should not be blocked for ever. Missed anything? |
| Comment by Di Wang [ 26/Jan/14 ] |
|
Hmm, right now, OST can not tell whether MDT0 is being umounted or not being setup yet. So OST will retry if the request return -ENOTREADY or EBUSY. But I agree something needs to be fixed here, probably not in this patch. |
| Comment by Jodi Levi (Inactive) [ 20/Mar/14 ] |
|
Patches landed to Master. New tickets will be opened if additional fixes are required. |