[LU-15616] sanity-lnet test_226: Timeout occurred after 112 minutes, last suite running was sanity-lnet Created: 04/Mar/22 Updated: 23/Sep/22 Resolved: 03/Apr/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Chris Horn <chris.horn@hpe.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/01cd6eae-ac03-48f6-980d-3977224dfaad test_226 failed with the following error: Timeout occurred after 112 minutes, last suite running was sanity-lnet LNetNIFini() and discovery thread appear to have hit a deadlock: [Thu Mar 3 19:42:41 2022] INFO: task lnet_discovery:424118 blocked for more than 120 seconds. [Thu Mar 3 19:42:41 2022] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_lustre.x86_64 #1 [Thu Mar 3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Mar 3 19:42:41 2022] lnet_discovery D 0 424118 2 0x80004080 [Thu Mar 3 19:42:41 2022] Call Trace: [Thu Mar 3 19:42:41 2022] __schedule+0x2c4/0x700 [Thu Mar 3 19:42:41 2022] schedule+0x38/0xa0 [Thu Mar 3 19:42:41 2022] schedule_preempt_disabled+0xa/0x10 [Thu Mar 3 19:42:41 2022] __mutex_lock.isra.5+0x2d0/0x4a0 [Thu Mar 3 19:42:41 2022] lnet_peer_discovery+0x929/0x16c0 [lnet] [Thu Mar 3 19:42:41 2022] ? finish_wait+0x80/0x80 [Thu Mar 3 19:42:41 2022] ? lnet_peer_merge_data+0xff0/0xff0 [lnet] [Thu Mar 3 19:42:41 2022] kthread+0x112/0x130 [Thu Mar 3 19:42:41 2022] ? kthread_flush_work_fn+0x10/0x10 [Thu Mar 3 19:42:41 2022] ret_from_fork+0x35/0x40 [Thu Mar 3 19:42:41 2022] INFO: task lnetctl:428295 blocked for more than 120 seconds. [Thu Mar 3 19:42:41 2022] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_lustre.x86_64 #1 [Thu Mar 3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Mar 3 19:42:41 2022] lnetctl D 0 428295 428283 0x00004080 [Thu Mar 3 19:42:41 2022] Call Trace: [Thu Mar 3 19:42:41 2022] __schedule+0x2c4/0x700 [Thu Mar 3 19:42:41 2022] ? __wake_up_common_lock+0x89/0xc0 [Thu Mar 3 19:42:41 2022] schedule+0x38/0xa0 [Thu Mar 3 19:42:41 2022] lnet_peer_discovery_stop+0x112/0x260 [lnet] [Thu Mar 3 19:42:41 2022] ? finish_wait+0x80/0x80 [Thu Mar 3 19:42:41 2022] LNetNIFini+0x5e/0x100 [lnet] [Thu Mar 3 19:42:41 2022] lnet_ioctl+0x220/0x260 [lnet] [Thu Mar 3 19:42:41 2022] notifier_call_chain+0x47/0x70 [Thu Mar 3 19:42:41 2022] blocking_notifier_call_chain+0x3e/0x60 [Thu Mar 3 19:42:41 2022] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [Thu Mar 3 19:42:41 2022] do_vfs_ioctl+0xa4/0x640 [Thu Mar 3 19:42:41 2022] ? syscall_trace_enter+0x1d3/0x2c0 [Thu Mar 3 19:42:41 2022] ksys_ioctl+0x60/0x90 [Thu Mar 3 19:42:41 2022] __x64_sys_ioctl+0x16/0x20 [Thu Mar 3 19:42:41 2022] do_syscall_64+0x5b/0x1a0 [Thu Mar 3 19:42:41 2022] entry_SYSCALL_64_after_hwframe+0x65/0xca LNetNIFini() has the ln_api_mutex and is waiting for the discovery thread to stop. The discovery thread needs the mutex to progress. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Chris Horn [ 04/Mar/22 ] |
|
I've seen other variations of this. For example: crash_x86_64> bt
PID: 524217 TASK: ffff8fc35f65af80 CPU: 2 COMMAND: "rmmod"
#0 [ffffa02f069abc68] __schedule at ffffffffae54e1d4
#1 [ffffa02f069abd00] schedule at ffffffffae54e648
#2 [ffffa02f069abd10] schedule_timeout at ffffffffae551cd3
#3 [ffffa02f069abda8] kiblnd_shutdown at ffffffffc0fa67e4 [ko2iblnd]
#4 [ffffa02f069abe28] lnet_shutdown_lndni at ffffffffc0ef0e43 [lnet]
#5 [ffffa02f069abe60] lnet_shutdown_lndnet at ffffffffc0ef10ac [lnet]
#6 [ffffa02f069abe88] lnet_shutdown_lndnets at ffffffffc0ef2029 [lnet]
#7 [ffffa02f069abed0] LNetNIFini at ffffffffc0ef3ccb [lnet]
#8 [ffffa02f069abed8] ptlrpc_exit at ffffffffc14e772e [ptlrpc]
#9 [ffffa02f069abee0] __x64_sys_delete_module at ffffffffadd8a0f9
#10 [ffffa02f069abf38] do_syscall_64 at ffffffffadc0420b
#11 [ffffa02f069abf50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
RIP: 00007f91fc94e72b RSP: 00007ffcdb74c588 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: 0000557518d633d0 RCX: 00007f91fc94e72b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000557518d63438
RBP: 0000000000000000 R8: 00007ffcdb74b501 R9: 0000000000000000
R10: 00007f91fc9c2200 R11: 0000000000000206 R12: 00007ffcdb74c7b0
R13: 00007ffcdb74e5f9 R14: 0000557518d612a0 R15: 0000557518d633d0
ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b
crash_x86_64>
LNetNIFini() takes ln_api_mutex: LNetNIFini(void)
{
mutex_lock(&the_lnet.ln_api_mutex);
The shutdown seems to be stuck in kiblnd_shutdown(): *hornc@cflosbld09 LUS-10766 $ grep kiblnd_shutdown dk.log.fmt 00000800:00000200:1.0:1645008262.725728:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008265.797688:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008272.965750:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008288.325727:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008320.069708:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:1.0:1645008384.581724:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:2.0:1645008514.629693:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:2.0:1645008775.749732:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:8.0:1645009299.013721:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:8.0:1645010346.565723:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:18.0:1645012442.693692:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:4.0:1645016635.973730:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect 00000800:00000200:5.0:1645025023.557725:0:524217:0:(o2iblnd.c:3050:kiblnd_shutdown()) 10.12.0.50@o2ib4000: waiting for 1 peers to disconnect *hornc@cflosbld09 LUS-10766 $ The hung tasks in kern log come from someone running lctl commands: crash_x86_64> ps | grep lctl
620605 620604 6 ffff8fc2bf4c5f00 UN 0.0 32392 2656 lctl
621757 621756 5 ffff8fad312797c0 UN 0.0 32392 2828 lctl
crash_x86_64> bt 620605
PID: 620605 TASK: ffff8fc2bf4c5f00 CPU: 6 COMMAND: "lctl"
#0 [ffffa02f082c7c38] __schedule at ffffffffae54e1d4
#1 [ffffa02f082c7cd0] schedule at ffffffffae54e648
#2 [ffffa02f082c7ce0] schedule_preempt_disabled at ffffffffae54e98a
#3 [ffffa02f082c7ce8] __mutex_lock at ffffffffae550690
#4 [ffffa02f082c7d78] LNetNIInit at ffffffffc0ef5634 [lnet]
#5 [ffffa02f082c7dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
#6 [ffffa02f082c7de8] notifier_call_chain at ffffffffadd06197
#7 [ffffa02f082c7e18] blocking_notifier_call_chain at ffffffffadd068be
#8 [ffffa02f082c7e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
#9 [ffffa02f082c7e80] do_vfs_ioctl at ffffffffadf2e084
#10 [ffffa02f082c7ef8] ksys_ioctl at ffffffffadf2e6c0
#11 [ffffa02f082c7f30] __x64_sys_ioctl at ffffffffadf2e706
#12 [ffffa02f082c7f38] do_syscall_64 at ffffffffadc0420b
#13 [ffffa02f082c7f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
RIP: 00007fa4dc43563b RSP: 00007ffee2603018 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: 00007fa4ddbb8b20 RCX: 00007fa4dc43563b
RDX: 00007ffee2603060 RSI: 00000000c0086532 RDI: 0000000000000003
RBP: 00000000c0086532 R8: 00007fa4ddde5960 R9: 0000000000000003
R10: 000000000000000f R11: 0000000000000206 R12: 000055fd4563253b
R13: 00007ffee2603060 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
crash_x86_64> bt 621757
PID: 621757 TASK: ffff8fad312797c0 CPU: 5 COMMAND: "lctl"
#0 [ffffa02f06727c38] __schedule at ffffffffae54e1d4
#1 [ffffa02f06727cd0] schedule at ffffffffae54e648
#2 [ffffa02f06727ce0] schedule_preempt_disabled at ffffffffae54e98a
#3 [ffffa02f06727ce8] __mutex_lock at ffffffffae550690
#4 [ffffa02f06727d78] LNetNIInit at ffffffffc0ef5634 [lnet]
#5 [ffffa02f06727dd0] lnet_ioctl at ffffffffc0f16fdf [lnet]
#6 [ffffa02f06727de8] notifier_call_chain at ffffffffadd06197
#7 [ffffa02f06727e18] blocking_notifier_call_chain at ffffffffadd068be
#8 [ffffa02f06727e40] libcfs_psdev_ioctl at ffffffffc0c20cb6 [libcfs]
#9 [ffffa02f06727e80] do_vfs_ioctl at ffffffffadf2e084
#10 [ffffa02f06727ef8] ksys_ioctl at ffffffffadf2e6c0
#11 [ffffa02f06727f30] __x64_sys_ioctl at ffffffffadf2e706
#12 [ffffa02f06727f38] do_syscall_64 at ffffffffadc0420b
#13 [ffffa02f06727f50] entry_SYSCALL_64_after_hwframe at ffffffffae6000ad
RIP: 00007f881e90c63b RSP: 00007ffca55ea0f8 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 00007f882008fb20 RCX: 00007f881e90c63b
RDX: 00007ffca55ea140 RSI: 00000000c0086532 RDI: 0000000000000003
RBP: 00000000c0086532 R8: 00007f88202bc960 R9: 0000000000000003
R10: 000000000000000f R11: 0000000000000202 R12: 0000561639d7753b
R13: 00007ffca55ea140 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
crash_x86_64>
The lctl commands attempt to acquire the ln_api_mutex. Since it is held by shutdown code, and shutdown is stuck, the lctl commands are hung/stuck too. |
| Comment by Gerrit Updater [ 07/Mar/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46727 |
| Comment by Chris Horn [ 01/Apr/22 ] |
|
One more example of this with 2.12: <root@ncn-w001>ps axOT | grep -Pi '^\s*\S+\s+\S+\s+D' 1117335 ? D 0:00 [kiblnd_sd_02_00] 1142272 ? D 0:00 lnetctl lnet unconfigure <root@ncn-w001>cat /proc/1117335/stack [<0>] lnet_nid2peerni_locked+0x53/0x170 [lnet] [<0>] lnet_parse+0x4c9/0x12e0 [lnet] [<0>] kiblnd_handle_rx+0x1bd/0x630 [ko2iblnd] [<0>] kiblnd_scheduler+0xfc8/0x1020 [ko2iblnd] [<0>] kthread+0x10d/0x130 [<0>] ret_from_fork+0x22/0x40 <root@ncn-w001>cat /proc/1142272/stack [<0>] kiblnd_shutdown+0xfa/0x490 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x265/0x460 [lnet] [<0>] lnet_shutdown_lndnet+0x67/0xd0 [lnet] [<0>] lnet_shutdown_lndnets+0x119/0x2d0 [lnet] [<0>] LNetNIFini+0x8b/0x100 [lnet] [<0>] lnet_ioctl+0x222/0x250 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x3e/0x60 [<0>] libcfs_ioctl+0xab/0x4a0 [libcfs] [<0>] libcfs_psdev_ioctl+0xba/0xd0 [libcfs] [<0>] do_vfs_ioctl+0xa0/0x680 [<0>] ksys_ioctl+0x70/0x80 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1e0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 <root@ncn-w001 |
| Comment by Gerrit Updater [ 03/Apr/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46727/ |
| Comment by Peter Jones [ 03/Apr/22 ] |
|
Landed for 2.15 |