[LU-13218] conf-sanity test 98 hangs in socknal_sd00_01: lnet_nid2peerni_locked Created: 07/Feb/20 Updated: 24/Mar/22 Resolved: 24/Mar/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
conf-sanity test_98 hangs for review-dne-zfs-part-3 for the patch https://review.whamcloud.com/37445/ for Looking at the hang at https://testing.whamcloud.com/test_sets/469de552-4869-11ea-b58e-52540065bddc, in the client1 console log we see LNET issues [19629.856282] Lustre: Unmounted lustre-client [19659.244701] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true [19659.384715] Key type lgssc unregistered [19661.272912] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19665.275764] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19673.280446] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19689.288850] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19721.304636] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19785.335214] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19809.073214] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds. [19809.074558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19809.075866] socknal_sd00_01 D ffff8e17ffd1ac80 0 4386 2 0x00000080 [19809.077194] Call Trace: [19809.077665] [<ffffffffb0581929>] schedule_preempt_disabled+0x29/0x70 [19809.078735] [<ffffffffb057f8b7>] __mutex_lock_slowpath+0xc7/0x1d0 [19809.079762] [<ffffffffb057ec8f>] mutex_lock+0x1f/0x2f [19809.080741] [<ffffffffc0891e31>] lnet_nid2peerni_locked+0x71/0x150 [lnet] [19809.082031] [<ffffffffc087ed01>] lnet_parse+0x791/0x11f0 [lnet] [19809.083044] [<ffffffffc0916838>] ksocknal_process_receive+0x498/0xde0 [ksocklnd] [19809.084277] [<ffffffffc0917626>] ksocknal_scheduler+0x206/0xd50 [ksocklnd] [19809.085499] [<ffffffffafec72e0>] ? wake_up_atomic_t+0x30/0x30 [19809.086567] [<ffffffffc0917420>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd] [19809.087734] [<ffffffffafec61f1>] kthread+0xd1/0xe0 [19809.088612] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40 [19809.089668] [<ffffffffb058dd37>] ret_from_fork_nospec_begin+0x21/0x21 [19809.090850] [<ffffffffafec6120>] ? insert_kthread_work+0x40/0x40 [19913.395403] LNet: 5951:0:(socklnd.c:2550:ksocknal_shutdown()) waiting for 1 peers to disconnect [19929.146713] INFO: task socknal_sd00_01:4386 blocked for more than 120 seconds. |
| Comments |
| Comment by Amir Shehata (Inactive) [ 11/Feb/20 ] |
|
This looks like a shutdown race condition. We call shutdown while a message from a peer is arriving. We lock the api mutex during shutdown. Then the receiving thread tries to lock it to create a peer NI, leading to this deadlock. the Shudown is waiting for the lnd thread to finish. We shouldn't be trying to create any new NIs while we're shutting down. |
| Comment by Jeremy Filizetti [ 24/Jun/20 ] |
|
I've seen this same thing with ko2iblnd on 2.12.5 yesterday as well. |
| Comment by Chris Horn [ 24/Mar/22 ] |