[LU-13218] conf-sanity test 98 hangs in socknal_sd00_01: lnet_nid2peerni_locked Created: 07/Feb/20  Updated: 24/Mar/22  Resolved: 24/Mar/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-15616 sanity-lnet test_226: Timeout occurre... Resolved
duplicates LU-15618 ksock_conn ref leak on shutdown Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_98 hangs for review-dne-zfs-part-3 for the patch https://review.whamcloud.com/37445/ for LU-12593. We've seen this test hang with the errors below once only so far.

Looking at the hang at https://testing.whamcloud.com/test_sets/469de552-4869-11ea-b58e-52540065bddc, in the client1 console log we see LNET issues

[19629.856282] Lustre: Unmounted lustre-client
[19659.244701] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true
[19659.384715] Key type lgssc unregistered
[19661.272912] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19665.275764] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19673.280446] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19689.288850] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19721.304636] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19785.335214] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19809.073214] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds.
[19809.074558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[19809.075866] socknal_sd00_01 D ffff8e17ffd1ac80     0  4386      2 0x00000080
[19809.077194] Call Trace:
[19809.077665]  [<ffffffffb0581929>] schedule_preempt_disabled+0x29/0x70
[19809.078735]  [<ffffffffb057f8b7>] __mutex_lock_slowpath+0xc7/0x1d0
[19809.079762]  [<ffffffffb057ec8f>] mutex_lock+0x1f/0x2f
[19809.080741]  [<ffffffffc0891e31>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
[19809.082031]  [<ffffffffc087ed01>] lnet_parse+0x791/0x11f0 [lnet]
[19809.083044]  [<ffffffffc0916838>] ksocknal_process_receive+0x498/0xde0 [ksocklnd]
[19809.084277]  [<ffffffffc0917626>] ksocknal_scheduler+0x206/0xd50 [ksocklnd]
[19809.085499]  [<ffffffffafec72e0>] ? wake_up_atomic_t+0x30/0x30
[19809.086567]  [<ffffffffc0917420>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
[19809.087734]  [<ffffffffafec61f1>] kthread+0xd1/0xe0
[19809.088612]  [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[19809.089668]  [<ffffffffb058dd37>] ret_from_fork_nospec_begin+0x21/0x21
[19809.090850]  [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40
[19913.395403] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect
[19929.146713] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds.


 Comments   
Comment by Amir Shehata (Inactive) [ 11/Feb/20 ]

This looks like a shutdown race condition. We call shutdown while a message from a peer is arriving. We lock the api mutex during shutdown. Then the receiving thread tries to lock it to create a peer NI, leading to this deadlock. the Shudown is waiting for the lnd thread to finish.

We shouldn't be trying to create any new NIs while we're shutting down.

Comment by Jeremy Filizetti [ 24/Jun/20 ]

I've seen this same thing with ko2iblnd on 2.12.5 yesterday as well.

Comment by Chris Horn [ 24/Mar/22 ]

Looks like LU-15616 or LU-15618 (or both) were hit.

Generated at Sat Feb 10 02:59:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.