Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.5
-
router node
lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
3.10.0-1127.18.2.1chaos.ch6.x86_64
-
3
-
9223372036854775807
Description
lctl and lnetctl commands hang.
console log messages on router:
[Thu Aug 6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds [Thu Aug 6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [Thu Aug 6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8 [Thu Aug 6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [Thu Aug 6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8
Watchdog complains about hung kernel tasks:
[Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds. [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080 [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm] [Thu Aug 6 13:47:28 2020] Call Trace: [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90 [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250 [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42 [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet] [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0 [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0 [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40 [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
utility commands hang on router with these stacks:
[root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done UID PID PPID C STIME TTY TIME CMD root 21845 1 0 13:49 ? 00:00:00 lctl list_nids [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet] [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet] [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70 [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70 [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20 [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs] [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs] [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0 [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0 [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff UID PID PPID C STIME TTY TIME CMD root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36 [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet] [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet] [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70 [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70 [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20 [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs] [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs] [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0 [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0 [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff UID PID PPID C STIME TTY TIME CMD root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet] [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet] [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70 [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70 [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20 [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs] [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs] [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0 [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0 [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff
The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.