[LU-10931] failed peer discovery still taking too long - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0
Affects Version/s: Lustre 2.13.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On master, when running conf-sanity I often see mount stuck in the following stack
trace:

n:lustre-release# stack1 llog
29833 llog_process_th
[<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
[<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet]
[<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
[<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
[<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
[<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc]
[<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass]
[<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass]
[<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass]
[<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass]
[<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
[<ffffffff810b252f>] kthread+0xcf/0xe0
[<ffffffff816b8798>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()?

Attachments

Issue Links

is duplicated by

LU-12416 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]

Resolved

is related to

LU-12442 recovery-small test_136: mounts stuck in lnet_discover_peer_locked()

Resolved

LU-12519 sanity-sec test 31 crashes with ASSERTION( list_empty(&lp->lp_peer_nets) )

Resolved

is related to

LU-12424 LNet MR routing: possible loop when discovery is off

Reopened

Activity

[LU-10931] failed peer discovery still taking too long

Amir Shehata (Inactive) added a comment - 05/Jul/19 12:18 AM

If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. What's happening here is the response times out before the send completes. So we have a refcount on the MD. The Unlink gets dropped on the floor. Send completes, but because we've already timed out the REPLY for the GET is dropped. Now we're left with a peer that is in:
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERING
LNET_PEER_PING_SENT
But no more events are coming to it, and the discovery never completes

Amir Shehata (Inactive) added a comment - 05/Jul/19 12:18 AM If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. What's happening here is the response times out before the send completes. So we have a refcount on the MD. The Unlink gets dropped on the floor. Send completes, but because we've already timed out the REPLY for the GET is dropped. Now we're left with a peer that is in: LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERING LNET_PEER_PING_SENT But no more events are coming to it, and the discovery never completes

Gerrit Updater added a comment - 03/Jul/19 4:01 PM

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35408
Subject: ~~LU-10931~~ tests: checking debug status for recovery-small
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 492cebf6912af3840cdbd39a815e5fc5e41bb96c

Gerrit Updater added a comment - 03/Jul/19 4:01 PM James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35408 Subject: LU-10931 tests: checking debug status for recovery-small Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 492cebf6912af3840cdbd39a815e5fc5e41bb96c

James Nunez (Inactive) added a comment - 02/Jul/19 3:04 PM - edited

Patch 35325 put recovery-small test 136 on the ALWAYS_EXCEPT list and does not fix this issue. Please keep this issue open until a fix for this issue is found and the test is removed from the ALWAYS_EXCEPT list.

James Nunez (Inactive) added a comment - 02/Jul/19 3:04 PM - edited Patch 35325 put recovery-small test 136 on the ALWAYS_EXCEPT list and does not fix this issue. Please keep this issue open until a fix for this issue is found and the test is removed from the ALWAYS_EXCEPT list.

Gerrit Updater added a comment - 02/Jul/19 2:59 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35325/
Subject: ~~LU-10931~~ tests: stop running recovery-small 136
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b511c65f468698aeb6e68f091ee3f30ec27830d0

Gerrit Updater added a comment - 02/Jul/19 2:59 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35325/ Subject: LU-10931 tests: stop running recovery-small 136 Project: fs/lustre-release Branch: master Current Patch Set: Commit: b511c65f468698aeb6e68f091ee3f30ec27830d0

James Nunez (Inactive) added a comment - 28/Jun/19 1:42 PM

I tested the patch for LU-12424, https://review.whamcloud.com/#/c/35191/, but we are still seeing this issue.

The patch https://review.whamcloud.com/#/c/35301/ is just a duplicate of Amir's LU-12424 patch and I ran recovery-small several times and still see this hang. Two failures with the LU-12424 patch are at
https://testing.whamcloud.com/test_sessions/6a97872a-fc69-444f-afbf-d44bb93d11cf
https://testing.whamcloud.com/test_sessions/6c9c2bb5-43c9-4ddd-a3a7-e5da94403e1a

Amir - Would you please comment on this issue and if the 35191 patch should fix this issue?

James Nunez (Inactive) added a comment - 28/Jun/19 1:42 PM I tested the patch for LU-12424 , https://review.whamcloud.com/#/c/35191/ , but we are still seeing this issue. The patch https://review.whamcloud.com/#/c/35301/ is just a duplicate of Amir's LU-12424 patch and I ran recovery-small several times and still see this hang. Two failures with the LU-12424 patch are at https://testing.whamcloud.com/test_sessions/6a97872a-fc69-444f-afbf-d44bb93d11cf https://testing.whamcloud.com/test_sessions/6c9c2bb5-43c9-4ddd-a3a7-e5da94403e1a Amir - Would you please comment on this issue and if the 35191 patch should fix this issue?

Andreas Dilger added a comment - 28/Jun/19 5:55 AM

Close as a duplicate because patch https://review.whamcloud.com/35191 "LU-12424 lnet: prevent loop in LNetPrimaryNID()" has landed and should resolve the issue.

Andreas Dilger added a comment - 28/Jun/19 5:55 AM Close as a duplicate because patch https://review.whamcloud.com/35191 " LU-12424 lnet: prevent loop in LNetPrimaryNID() " has landed and should resolve the issue.

Gerrit Updater added a comment - 25/Jun/19 9:57 PM

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35325
Subject: ~~LU-10931~~ tests: stop running recovery-small 136
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 38b6a60c213999d03e4ae5000a2bfdb7db1121b6

Gerrit Updater added a comment - 25/Jun/19 9:57 PM James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35325 Subject: LU-10931 tests: stop running recovery-small 136 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 38b6a60c213999d03e4ae5000a2bfdb7db1121b6

Gerrit Updater added a comment - 24/Jun/19 5:38 PM

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35301
Subject: ~~LU-10931~~ lnet: run testing against patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 08e1512871834b218f4f321743f9679e0a1b55e7

Gerrit Updater added a comment - 24/Jun/19 5:38 PM James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35301 Subject: LU-10931 lnet: run testing against patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 08e1512871834b218f4f321743f9679e0a1b55e7

Amir Shehata (Inactive) added a comment - 12/Jun/19 9:28 PM

https://review.whamcloud.com/#/c/35191/

Amir Shehata (Inactive) added a comment - 12/Jun/19 9:28 PM https://review.whamcloud.com/#/c/35191/

James Nunez (Inactive) added a comment - 12/Jun/19 3:06 PM

Starting on 2019-06-08, we are seeing recovery-small test 136 hang for DNE testing with the following on an MDS

[  237.181455] Pid: 4761, comm: mdt00_005 3.10.0-957.12.2.el7_lustre.x86_64 #1 SMP Wed Jun 5 06:59:00 UTC 2019
[  237.181455] Call Trace:
[  237.181469]  [<ffffffffc0ccf894>] lnet_discover_peer_locked+0x124/0x3d0 [lnet]
[  237.181476]  [<ffffffffc0ccfbb0>] LNetPrimaryNID+0x70/0x1a0 [lnet]
[  237.181500]  [<ffffffffc10675fe>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
[  237.181525]  [<ffffffffc106b664>] ptlrpc_send_reply+0x394/0x840 [ptlrpc]
[  237.181548]  [<ffffffffc106bbdb>] ptlrpc_send_error+0x9b/0x1b0 [ptlrpc]
[  237.181573]  [<ffffffffc106bd00>] ptlrpc_error+0x10/0x20 [ptlrpc]
[  237.181613]  [<ffffffffc10da898>] tgt_request_handle+0xad8/0x15c0 [ptlrpc]
[  237.181641]  [<ffffffffc107e7ee>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[  237.181666]  [<ffffffffc10822dc>] ptlrpc_main+0xbac/0x1560 [ptlrpc]
[  237.181668]  [<ffffffff9dcc1d21>] kthread+0xd1/0xe0
[  237.181670]  [<ffffffff9e375c37>] ret_from_fork_nospec_end+0x0/0x39
[  237.181675]  [<ffffffffffffffff>] 0xffffffffffffffff
[  237.201321] Lustre: Skipped 2 previous similar messages
[  240.228165] INFO: task mount.lustre:4570 blocked for more than 120 seconds.
[  240.228901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.229677] mount.lustre    D ffff96b36ea2a080     0  4570   4569 0x00000080
[  240.230546] Call Trace:
[  240.230813]  [<ffffffff9e368e19>] schedule+0x29/0x70
[  240.231317]  [<ffffffff9e366921>] schedule_timeout+0x221/0x2d0
[  240.231899]  [<ffffffff9dcdc926>] ? select_task_rq_fair+0x5a6/0x760
[  240.232679]  [<ffffffff9e3691cd>] wait_for_completion+0xfd/0x140
[  240.233310]  [<ffffffff9dcd6ae0>] ? wake_up_state+0x20/0x20
[  240.233901]  [<ffffffffc0d4bae4>] llog_process_or_fork+0x244/0x450 [obdclass]
[  240.234615]  [<ffffffffc0d4bd04>] llog_process+0x14/0x20 [obdclass]
[  240.235315]  [<ffffffffc0d7eca5>] class_config_parse_llog+0x125/0x350 [obdclass]
[  240.236061]  [<ffffffffc0ffafd0>] mgc_process_cfg_log+0x790/0xc40 [mgc]
[  240.236724]  [<ffffffffc0ffe4b9>] mgc_process_log+0x3d9/0x8f0 [mgc]
[  240.237393]  [<ffffffffc0fff14f>] ? config_recover_log_add+0x13f/0x280 [mgc]
[  240.238188]  [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[  240.238976]  [<ffffffffc0fffb1b>] mgc_process_config+0x88b/0x13f0 [mgc]
[  240.239643]  [<ffffffffc0d8ab18>] lustre_process_log+0x2d8/0xad0 [obdclass]
[  240.240394]  [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  240.241070]  [<ffffffffc0d75839>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  240.241793]  [<ffffffffc0db9184>] server_start_targets+0x13a4/0x2a20 [obdclass]
[  240.242578]  [<ffffffffc0d75961>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[  240.243321]  [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[  240.244132]  [<ffffffffc0dbb8cc>] server_fill_super+0x10cc/0x1890 [obdclass]
[  240.244857]  [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  240.245524]  [<ffffffffc0d90798>] lustre_fill_super+0x328/0x950 [obdclass]
[  240.246317]  [<ffffffffc0d90470>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[  240.247052]  [<ffffffff9de457cf>] mount_nodev+0x4f/0xb0
[  240.247599]  [<ffffffffc0d88968>] lustre_mount+0x38/0x60 [obdclass]
[  240.248301]  [<ffffffff9de4634e>] mount_fs+0x3e/0x1b0
[  240.248848]  [<ffffffff9de63ec7>] vfs_kern_mount+0x67/0x110
[  240.249404]  [<ffffffff9de664ef>] do_mount+0x1ef/0xce0
[  240.249962]  [<ffffffff9de3e7aa>] ? __check_object_size+0x1ca/0x250
[  240.250584]  [<ffffffff9de1caec>] ? kmem_cache_alloc_trace+0x3c/0x200
[  240.251268]  [<ffffffff9de67323>] SyS_mount+0x83/0xd0
[  240.251845]  [<ffffffff9e375ddb>] system_call_fastpath+0x22/0x27
[  240.252445]  [<ffffffff9e375d21>] ? system_call_after_swapgs+0xae/0x146
[  241.277138] Lustre: mdt00_006: service thread pid 4762 was inactive for 200.432 seconds. The thread might be hung, or it might only be slow and will resume later.

See the following for more logs
https://testing.whamcloud.com/test_sets/c36ae980-8cc7-11e9-abe3-52540065bddc
https://testing.whamcloud.com/test_sets/0eeb6f58-8bfe-11e9-a77a-52540065bddc

James Nunez (Inactive) added a comment - 12/Jun/19 3:06 PM Starting on 2019-06-08, we are seeing recovery-small test 136 hang for DNE testing with the following on an MDS [ 237.181455] Pid: 4761, comm: mdt00_005 3.10.0-957.12.2.el7_lustre.x86_64 #1 SMP Wed Jun 5 06:59:00 UTC 2019 [ 237.181455] Call Trace: [ 237.181469] [<ffffffffc0ccf894>] lnet_discover_peer_locked+0x124/0x3d0 [lnet] [ 237.181476] [<ffffffffc0ccfbb0>] LNetPrimaryNID+0x70/0x1a0 [lnet] [ 237.181500] [<ffffffffc10675fe>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [ 237.181525] [<ffffffffc106b664>] ptlrpc_send_reply+0x394/0x840 [ptlrpc] [ 237.181548] [<ffffffffc106bbdb>] ptlrpc_send_error+0x9b/0x1b0 [ptlrpc] [ 237.181573] [<ffffffffc106bd00>] ptlrpc_error+0x10/0x20 [ptlrpc] [ 237.181613] [<ffffffffc10da898>] tgt_request_handle+0xad8/0x15c0 [ptlrpc] [ 237.181641] [<ffffffffc107e7ee>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 237.181666] [<ffffffffc10822dc>] ptlrpc_main+0xbac/0x1560 [ptlrpc] [ 237.181668] [<ffffffff9dcc1d21>] kthread+0xd1/0xe0 [ 237.181670] [<ffffffff9e375c37>] ret_from_fork_nospec_end+0x0/0x39 [ 237.181675] [<ffffffffffffffff>] 0xffffffffffffffff [ 237.201321] Lustre: Skipped 2 previous similar messages [ 240.228165] INFO: task mount.lustre:4570 blocked for more than 120 seconds. [ 240.228901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.229677] mount.lustre D ffff96b36ea2a080 0 4570 4569 0x00000080 [ 240.230546] Call Trace: [ 240.230813] [<ffffffff9e368e19>] schedule+0x29/0x70 [ 240.231317] [<ffffffff9e366921>] schedule_timeout+0x221/0x2d0 [ 240.231899] [<ffffffff9dcdc926>] ? select_task_rq_fair+0x5a6/0x760 [ 240.232679] [<ffffffff9e3691cd>] wait_for_completion+0xfd/0x140 [ 240.233310] [<ffffffff9dcd6ae0>] ? wake_up_state+0x20/0x20 [ 240.233901] [<ffffffffc0d4bae4>] llog_process_or_fork+0x244/0x450 [obdclass] [ 240.234615] [<ffffffffc0d4bd04>] llog_process+0x14/0x20 [obdclass] [ 240.235315] [<ffffffffc0d7eca5>] class_config_parse_llog+0x125/0x350 [obdclass] [ 240.236061] [<ffffffffc0ffafd0>] mgc_process_cfg_log+0x790/0xc40 [mgc] [ 240.236724] [<ffffffffc0ffe4b9>] mgc_process_log+0x3d9/0x8f0 [mgc] [ 240.237393] [<ffffffffc0fff14f>] ? config_recover_log_add+0x13f/0x280 [mgc] [ 240.238188] [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 240.238976] [<ffffffffc0fffb1b>] mgc_process_config+0x88b/0x13f0 [mgc] [ 240.239643] [<ffffffffc0d8ab18>] lustre_process_log+0x2d8/0xad0 [obdclass] [ 240.240394] [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 240.241070] [<ffffffffc0d75839>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 240.241793] [<ffffffffc0db9184>] server_start_targets+0x13a4/0x2a20 [obdclass] [ 240.242578] [<ffffffffc0d75961>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] [ 240.243321] [<ffffffffc0d86f00>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 240.244132] [<ffffffffc0dbb8cc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 240.244857] [<ffffffffc0c3b177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 240.245524] [<ffffffffc0d90798>] lustre_fill_super+0x328/0x950 [obdclass] [ 240.246317] [<ffffffffc0d90470>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 240.247052] [<ffffffff9de457cf>] mount_nodev+0x4f/0xb0 [ 240.247599] [<ffffffffc0d88968>] lustre_mount+0x38/0x60 [obdclass] [ 240.248301] [<ffffffff9de4634e>] mount_fs+0x3e/0x1b0 [ 240.248848] [<ffffffff9de63ec7>] vfs_kern_mount+0x67/0x110 [ 240.249404] [<ffffffff9de664ef>] do_mount+0x1ef/0xce0 [ 240.249962] [<ffffffff9de3e7aa>] ? __check_object_size+0x1ca/0x250 [ 240.250584] [<ffffffff9de1caec>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 240.251268] [<ffffffff9de67323>] SyS_mount+0x83/0xd0 [ 240.251845] [<ffffffff9e375ddb>] system_call_fastpath+0x22/0x27 [ 240.252445] [<ffffffff9e375d21>] ? system_call_after_swapgs+0xae/0x146 [ 241.277138] Lustre: mdt00_006: service thread pid 4762 was inactive for 200.432 seconds. The thread might be hung, or it might only be slow and will resume later. See the following for more logs https://testing.whamcloud.com/test_sets/c36ae980-8cc7-11e9-abe3-52540065bddc https://testing.whamcloud.com/test_sets/0eeb6f58-8bfe-11e9-a77a-52540065bddc

People

Assignee:: Amir Shehata (Inactive)

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 19/Apr/18 4:47 PM

Updated:: 30/Jan/22 3:42 AM

Resolved:: 17/Sep/19 1:23 AM