[LU-15616] sanity-lnet test_226: Timeout occurred after 112 minutes, last suite running was sanity-lnet Created: 04/Mar/22  Updated: 23/Sep/22  Resolved: 03/Apr/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15650 sanity-lnet: test_102 Timeout occurre... Resolved
is duplicated by LU-12148 conf-sanity test_64: timed out Closed
is duplicated by LU-13218 conf-sanity test 98 hangs in socknal_... Closed
Related
is related to LU-15705 sanity-lnet test_103: hang in lnetctl... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <chris.horn@hpe.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/01cd6eae-ac03-48f6-980d-3977224dfaad

test_226 failed with the following error:

Timeout occurred after 112 minutes, last suite running was sanity-lnet

LNetNIFini() and discovery thread appear to have hit a deadlock:

[Thu Mar  3 19:42:41 2022] INFO: task lnet_discovery:424118 blocked for more than 120 seconds.
[Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
[Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Mar  3 19:42:41 2022] lnet_discovery  D    0 424118      2 0x80004080
[Thu Mar  3 19:42:41 2022] Call Trace:
[Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
[Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
[Thu Mar  3 19:42:41 2022]  schedule_preempt_disabled+0xa/0x10
[Thu Mar  3 19:42:41 2022]  __mutex_lock.isra.5+0x2d0/0x4a0
[Thu Mar  3 19:42:41 2022]  lnet_peer_discovery+0x929/0x16c0 [lnet]
[Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
[Thu Mar  3 19:42:41 2022]  ? lnet_peer_merge_data+0xff0/0xff0 [lnet]
[Thu Mar  3 19:42:41 2022]  kthread+0x112/0x130
[Thu Mar  3 19:42:41 2022]  ? kthread_flush_work_fn+0x10/0x10
[Thu Mar  3 19:42:41 2022]  ret_from_fork+0x35/0x40
[Thu Mar  3 19:42:41 2022] INFO: task lnetctl:428295 blocked for more than 120 seconds.
[Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
[Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Mar  3 19:42:41 2022] lnetctl         D    0 428295 428283 0x00004080
[Thu Mar  3 19:42:41 2022] Call Trace:
[Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
[Thu Mar  3 19:42:41 2022]  ? __wake_up_common_lock+0x89/0xc0
[Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
[Thu Mar  3 19:42:41 2022]  lnet_peer_discovery_stop+0x112/0x260 [lnet]
[Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
[Thu Mar  3 19:42:41 2022]  LNetNIFini+0x5e/0x100 [lnet]
[Thu Mar  3 19:42:41 2022]  lnet_ioctl+0x220/0x260 [lnet]
[Thu Mar  3 19:42:41 2022]  notifier_call_chain+0x47/0x70
[Thu Mar  3 19:42:41 2022]  blocking_notifier_call_chain+0x3e/0x60
[Thu Mar  3 19:42:41 2022]  libcfs_psdev_ioctl+0x346/0x590 [libcfs]
[Thu Mar  3 19:42:41 2022]  do_vfs_ioctl+0xa4/0x640
[Thu Mar  3 19:42:41 2022]  ? syscall_trace_enter+0x1d3/0x2c0
[Thu Mar  3 19:42:41 2022]  ksys_ioctl+0x60/0x90
[Thu Mar  3 19:42:41 2022]  __x64_sys_ioctl+0x16/0x20
[Thu Mar  3 19:42:41 2022]  do_syscall_64+0x5b/0x1a0
[Thu Mar  3 19:42:41 2022]  entry_SYSCALL_64_after_hwframe+0x65/0xca

LNetNIFini() has the ln_api_mutex and is waiting for the discovery thread to stop. The discovery thread needs the mutex to progress.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_226 - Timeout occurred after 112 minutes, last suite running was sanity-lnet



 Comments   
Comment by Chris Horn [ 04/Mar/22 ]

LU-12233 was another issue with deadlock on shutdown.

I've seen other variations of this. For example:
LNet shutdown initiated by lustre_rmmod:

crash_x86_64> bt
PID: 524217  TASK: ffff8fc35f65af80  CPU: 2   COMMAND: "rmmod"
 #0 [ffffa02f069abc68] __schedule at ffffffffae54e1d4
 #1 [ffffa02f069abd00] schedule at ffffffffae54e648
 #2 [ffffa02f069abd10] schedule_timeout at ffffffffae551cd3
 #3 [ffffa02f069abda8] kiblnd_shutdown at ffffffffc0fa67e4 [ko2iblnd]
 #4 [ffffa02f069abe28] lnet_shutdown_lndni at ffffffffc0ef0e43 [lnet]
 #5 [ffffa02f069abe60] lnet_shutdown_lndnet at ffffffffc0ef10ac [lnet]
 #6 [ffffa02f069abe88] lnet_shutdown_lndnets at ffffffffc0ef2029 [lnet]
 #7 [ffffa02f069abed0] LNetNIFini at ffffffffc0ef3ccb [lnet]
 #8 [ffffa02f069abed8] ptlrpc_exit at ffffffffc14e772e [ptlrpc]
 #9 [ffffa02f069abee0] __x64_sys_delete_module at ffffffffadd8a0f9
#10 [ffffa02f069abf38] do_syscall_64 at ffffffffadc0420b
#11 [ffffa02f069abf50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
    RIP: 00007f91fc94e72b  RSP: 00007ffcdb74c588  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 0000557518d633d0  RCX: 00007f91fc94e72b
    RDX: 000000000000000a  RSI: 0000000000000800  RDI: 0000557518d63438
    RBP: 0000000000000000   R8: 00007ffcdb74b501   R9: 0000000000000000
    R10: 00007f91fc9c2200  R11: 0000000000000206  R12: 00007ffcdb74c7b0
    R13: 00007ffcdb74e5f9  R14: 0000557518d612a0  R15: 0000557518d633d0
    ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
crash_x86_64>

LNetNIFini() takes ln_api_mutex:

LNetNIFini(void)
{
        mutex_lock(&the_lnet.ln_api_mutex);

The shutdown seems to be stuck in kiblnd_shutdown():

*hornc@cflosbld09 LUS-10766 $ grep kiblnd_shutdown dk.log.fmt
00000800:00000200:1.0:1645008262.725728:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:1.0:1645008265.797688:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:1.0:1645008272.965750:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:1.0:1645008288.325727:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:1.0:1645008320.069708:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:1.0:1645008384.581724:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:2.0:1645008514.629693:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:2.0:1645008775.749732:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:8.0:1645009299.013721:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:8.0:1645010346.565723:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:18.0:1645012442.693692:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:4.0:1645016635.973730:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
00000800:00000200:5.0:1645025023.557725:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect
*hornc@cflosbld09 LUS-10766 $

The hung tasks in kern log come from someone running lctl commands:

crash_x86_64> ps | grep lctl
  620605  620604   6  ffff8fc2bf4c5f00  UN   0.0   32392   2656  lctl
  621757  621756   5  ffff8fad312797c0  UN   0.0   32392   2828  lctl
crash_x86_64> bt 620605
PID: 620605  TASK: ffff8fc2bf4c5f00  CPU: 6   COMMAND: "lctl"
 #0 [ffffa02f082c7c38] __schedule at ffffffffae54e1d4
 #1 [ffffa02f082c7cd0] schedule at ffffffffae54e648
 #2 [ffffa02f082c7ce0] schedule_preempt_disabled at ffffffffae54e98a
 #3 [ffffa02f082c7ce8] __mutex_lock at ffffffffae550690
 #4 [ffffa02f082c7d78] LNetNIInit at ffffffffc0ef5634 [lnet]
 #5 [ffffa02f082c7dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
 #6 [ffffa02f082c7de8] notifier_call_chain at ffffffffadd06197
 #7 [ffffa02f082c7e18] blocking_notifier_call_chain at ffffffffadd068be
 #8 [ffffa02f082c7e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
 #9 [ffffa02f082c7e80] do_vfs_ioctl at ffffffffadf2e084
#10 [ffffa02f082c7ef8] ksys_ioctl at ffffffffadf2e6c0
#11 [ffffa02f082c7f30] __x64_sys_ioctl at ffffffffadf2e706
#12 [ffffa02f082c7f38] do_syscall_64 at ffffffffadc0420b
#13 [ffffa02f082c7f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
    RIP: 00007fa4dc43563b  RSP: 00007ffee2603018  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 00007fa4ddbb8b20  RCX: 00007fa4dc43563b
    RDX: 00007ffee2603060  RSI: 00000000c0086532  RDI: 0000000000000003
    RBP: 00000000c0086532   R8: 00007fa4ddde5960   R9: 0000000000000003
    R10: 000000000000000f  R11: 0000000000000206  R12: 000055fd4563253b
    R13: 00007ffee2603060  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash_x86_64> bt 621757
PID: 621757  TASK: ffff8fad312797c0  CPU: 5   COMMAND: "lctl"
 #0 [ffffa02f06727c38] __schedule at ffffffffae54e1d4
 #1 [ffffa02f06727cd0] schedule at ffffffffae54e648
 #2 [ffffa02f06727ce0] schedule_preempt_disabled at ffffffffae54e98a
 #3 [ffffa02f06727ce8] __mutex_lock at ffffffffae550690
 #4 [ffffa02f06727d78] LNetNIInit at ffffffffc0ef5634 [lnet]
 #5 [ffffa02f06727dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
 #6 [ffffa02f06727de8] notifier_call_chain at ffffffffadd06197
 #7 [ffffa02f06727e18] blocking_notifier_call_chain at ffffffffadd068be
 #8 [ffffa02f06727e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
 #9 [ffffa02f06727e80] do_vfs_ioctl at ffffffffadf2e084
#10 [ffffa02f06727ef8] ksys_ioctl at ffffffffadf2e6c0
#11 [ffffa02f06727f30] __x64_sys_ioctl at ffffffffadf2e706
#12 [ffffa02f06727f38] do_syscall_64 at ffffffffadc0420b
#13 [ffffa02f06727f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
    RIP: 00007f881e90c63b  RSP: 00007ffca55ea0f8  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007f882008fb20  RCX: 00007f881e90c63b
    RDX: 00007ffca55ea140  RSI: 00000000c0086532  RDI: 0000000000000003
    RBP: 00000000c0086532   R8: 00007f88202bc960   R9: 0000000000000003
    R10: 000000000000000f  R11: 0000000000000202  R12: 0000561639d7753b
    R13: 00007ffca55ea140  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash_x86_64>

The lctl commands attempt to acquire the ln_api_mutex. Since it is held by shutdown code, and shutdown is stuck, the lctl commands are hung/stuck too.

Comment by Gerrit Updater [ 07/Mar/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46727
Subject: LU-15616 lnet: ln_api_mutex deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 60722c29610275f533b4ca8e28fa9852c570a52d

Comment by Chris Horn [ 01/Apr/22 ]

One more example of this with 2.12:

<root@ncn-w001>ps axOT | grep -Pi '^\s*\S+\s+\S+\s+D'
1117335 ?        D      0:00 [kiblnd_sd_02_00]
1142272 ?        D      0:00 lnetctl lnet unconfigure
<root@ncn-w001>cat /proc/1117335/stack
[<0>] lnet_nid2peerni_locked+0x53/0x170 [lnet]
[<0>] lnet_parse+0x4c9/0x12e0 [lnet]
[<0>] kiblnd_handle_rx+0x1bd/0x630 [ko2iblnd]
[<0>] kiblnd_scheduler+0xfc8/0x1020 [ko2iblnd]
[<0>] kthread+0x10d/0x130
[<0>] ret_from_fork+0x22/0x40
<root@ncn-w001>cat /proc/1142272/stack
[<0>] kiblnd_shutdown+0xfa/0x490 [ko2iblnd]
[<0>] lnet_shutdown_lndni+0x265/0x460 [lnet]
[<0>] lnet_shutdown_lndnet+0x67/0xd0 [lnet]
[<0>] lnet_shutdown_lndnets+0x119/0x2d0 [lnet]
[<0>] LNetNIFini+0x8b/0x100 [lnet]
[<0>] lnet_ioctl+0x222/0x250 [lnet]
[<0>] notifier_call_chain+0x47/0x70
[<0>] blocking_notifier_call_chain+0x3e/0x60
[<0>] libcfs_ioctl+0xab/0x4a0 [libcfs]
[<0>] libcfs_psdev_ioctl+0xba/0xd0 [libcfs]
[<0>] do_vfs_ioctl+0xa0/0x680
[<0>] ksys_ioctl+0x70/0x80
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x5b/0x1e0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
<root@ncn-w001
Comment by Gerrit Updater [ 03/Apr/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46727/
Subject: LU-15616 lnet: ln_api_mutex deadlocks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 22de0bd145b649768b16dd42559d326af3c13200

Comment by Peter Jones [ 03/Apr/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:19:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.