[LU-10931] failed peer discovery still taking too long Created: 19/Apr/18 Updated: 30/Jan/22 Resolved: 17/Sep/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | John Hammond | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
On master, when running conf-sanity I often see mount stuck in the following stack n:lustre-release# stack1 llog 29833 llog_process_th [<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet] [<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet] [<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc] [<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass] [<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass] [<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass] [<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass] [<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()? |
| Comments |
| Comment by James Nunez (Inactive) [ 12/Jun/19 ] |
|
Starting on 2019-06-08, we are seeing recovery-small test 136 hang for DNE testing with the following on an MDS [ 237.181455] Pid: 4761, comm: mdt00_005 3.10.0-957.12.2.el7_lustre.x86_64 #1 SMP Wed Jun 5 06:59:00 UTC 2019 [ 237.181455] Call Trace: [ 237.181469] [<ffffffffc0ccf894>] lnet_discover_peer_locked+0x124/0x3d0 [lnet] [ 237.181476] [<ffffffffc0ccfbb0>] LNetPrimaryNID+0x70/0x1a0 [lnet] [ 237.181500] [<ffffffffc10675fe>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [ 237.181525] [<ffffffffc106b664>] ptlrpc_send_reply+0x394/0x840 [ptlrpc] [ 237.181548] [<ffffffffc106bbdb>] ptlrpc_send_error+0x9b/0x1b0 [ptlrpc] [ 237.181573] [<ffffffffc106bd00>] ptlrpc_error+0x10/0x20 [ptlrpc] [ 237.181613] [<ffffffffc10da898>] tgt_request_handle+0xad8/0x15c0 [ptlrpc] [ 237.181641] [<ffffffffc107e7ee>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 237.181666] [<ffffffffc10822dc>] ptlrpc_main+0xbac/0x1560 [ptlrpc] [ 237.181668] [<ffffffff9dcc1d21>] kthread+0xd1/0xe0 [ 237.181670] [<ffffffff9e375c37>] ret_from_fork_nospec_end+0x0/0x39 [ 237.181675] [<ffffffffffffffff>] 0xffffffffffffffff [ 237.201321] Lustre: Skipped 2 previous similar messages [ 240.228165] INFO: task mount.lustre:4570 blocked for more than 120 seconds. [ 240.228901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.229677] mount.lustre D ffff96b36ea2a080 0 4570 4569 0x00000080 [ 240.230546] Call Trace: [ 240.230813] [<ffffffff9e368e19>] schedule+0x29/0x70 [ 240.231317] [<ffffffff9e366921>] schedule_timeout+0x221/0x2d0 [ 240.231899] [<ffffffff9dcdc926>] ? select_task_rq_fair+0x5a6/0x760 [ 240.232679] [<ffffffff9e3691cd>] wait_for_completion+0xfd/0x140 [ 240.233310] [<ffffffff9dcd6ae0>] ? wake_up_state+0x20/0x20 [ 240.233901] [<ffffffffc0d4bae4>] llog_process_or_fork+0x244/0x450 [obdclass] [ 240.234615] [<ffffffffc0d4bd04>] llog_process+0x14/0x20 [obdclass] [ 240.235315] [<ffffffffc0d7eca5>] class_config_parse_llog+0x125/0x350 [obdclass] [ 240.236061] [<ffffffffc0ffafd0>] mgc_process_cfg_log+0x790/0xc40 [mgc] [ 240.236724] [<ffffffffc0ffe4b9>] mgc_process_log+0x3d9/0x8f0 [mgc] [ 240.237393] [<ffffffffc0fff14f>] ? config_recover_log_add+0x13f/0x280 [mgc] [ 240.238188] [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 240.238976] [<ffffffffc0fffb1b>] mgc_process_config+0x88b/0x13f0 [mgc] [ 240.239643] [<ffffffffc0d8ab18>] lustre_process_log+0x2d8/0xad0 [obdclass] [ 240.240394] [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 240.241070] [<ffffffffc0d75839>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 240.241793] [<ffffffffc0db9184>] server_start_targets+0x13a4/0x2a20 [obdclass] [ 240.242578] [<ffffffffc0d75961>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] [ 240.243321] [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 240.244132] [<ffffffffc0dbb8cc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 240.244857] [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 240.245524] [<ffffffffc0d90798>] lustre_fill_super+0x328/0x950 [obdclass] [ 240.246317] [<ffffffffc0d90470>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 240.247052] [<ffffffff9de457cf>] mount_nodev+0x4f/0xb0 [ 240.247599] [<ffffffffc0d88968>] lustre_mount+0x38/0x60 [obdclass] [ 240.248301] [<ffffffff9de4634e>] mount_fs+0x3e/0x1b0 [ 240.248848] [<ffffffff9de63ec7>] vfs_kern_mount+0x67/0x110 [ 240.249404] [<ffffffff9de664ef>] do_mount+0x1ef/0xce0 [ 240.249962] [<ffffffff9de3e7aa>] ? __check_object_size+0x1ca/0x250 [ 240.250584] [<ffffffff9de1caec>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 240.251268] [<ffffffff9de67323>] SyS_mount+0x83/0xd0 [ 240.251845] [<ffffffff9e375ddb>] system_call_fastpath+0x22/0x27 [ 240.252445] [<ffffffff9e375d21>] ? system_call_after_swapgs+0xae/0x146 [ 241.277138] Lustre: mdt00_006: service thread pid 4762 was inactive for 200.432 seconds. The thread might be hung, or it might only be slow and will resume later. See the following for more logs |
| Comment by Amir Shehata (Inactive) [ 12/Jun/19 ] |
| Comment by Gerrit Updater [ 24/Jun/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35301 |
| Comment by Gerrit Updater [ 25/Jun/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35325 |
| Comment by Andreas Dilger [ 28/Jun/19 ] |
|
Close as a duplicate because patch https://review.whamcloud.com/35191 "LU-12424 lnet: prevent loop in LNetPrimaryNID()" has landed and should resolve the issue. |
| Comment by James Nunez (Inactive) [ 28/Jun/19 ] |
|
I tested the patch for LU-12424, https://review.whamcloud.com/#/c/35191/, but we are still seeing this issue. The patch https://review.whamcloud.com/#/c/35301/ is just a duplicate of Amir's LU-12424 patch and I ran recovery-small several times and still see this hang. Two failures with the LU-12424 patch are at Amir - Would you please comment on this issue and if the 35191 patch should fix this issue? |
| Comment by Gerrit Updater [ 02/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35325/ |
| Comment by James Nunez (Inactive) [ 02/Jul/19 ] |
|
Patch 35325 put recovery-small test 136 on the ALWAYS_EXCEPT list and does not fix this issue. Please keep this issue open until a fix for this issue is found and the test is removed from the ALWAYS_EXCEPT list. |
| Comment by Gerrit Updater [ 03/Jul/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35408 |
| Comment by Amir Shehata (Inactive) [ 05/Jul/19 ] |
|
If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. What's happening here is the response times out before the send completes. So we have a refcount on the MD. The Unlink gets dropped on the floor. Send completes, but because we've already timed out the REPLY for the GET is dropped. Now we're left with a peer that is in: |
| Comment by Gerrit Updater [ 08/Jul/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35444 |
| Comment by Gerrit Updater [ 21/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35444/ |
| Comment by Peter Jones [ 21/Aug/19 ] |
|
Landed for 2.13 |
| Comment by James Nunez (Inactive) [ 21/Aug/19 ] |
|
Reopening this ticket because recovery-small test 136 is still on the ALWAYS_EXCEPT list. We need to submit a patch to remove this test from the list and confirm that the patch fixes the issues reveled by this test. |
| Comment by Gerrit Updater [ 28/Aug/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35949 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35949/ |
| Comment by James Nunez (Inactive) [ 17/Sep/19 ] |
|
recovery-small test 136 is running again for the master branch. |
| Comment by Gerrit Updater [ 20/Dec/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45898 |
| Comment by Gerrit Updater [ 30/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45898/ |