Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.14.0
-
TOSS 4.3 (based on RHEL 8.5)
4.18.0-348.7.1.1toss.t4.x86_64
lustre 2.14.0_10.llnl
-
3
-
9223372036854775807
Description
Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet.
We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:
- "ping" (the unix utility) between the two nodes via IPoIB is successful, in either direction
- ib_write_bw between the two nodes via the IB network is successful, in either direction
When LNet starts, it begins reporting the following on the console:
LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110
Eventually, we see the following on the console:
INFO: task kworker/u128:2:5350 blocked for more than 120 seconds. Tainted: P OE --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u128:2 state:D stack: 0 pid: 5350 ppid: 2 flags:0x80004080 Workqueue: rdma_cm cma_work_handler [rdma_cm] Call Trace: __schedule+0x2c0/0x770 schedule+0x4c/0xc0 schedule_preempt_disabled+0x11/0x20 __mutex_lock.isra.6+0x343/0x550 rdma_connect+0x1e/0x40 [rdma_cm] kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd] ? __switch_to_asm+0x41/0x70 cma_cm_event_handler+0x25/0xf0 [rdma_cm] cma_work_handler+0x5a/0xb0 [rdma_cm] process_one_work+0x1ae/0x3a0 worker_thread+0x3c/0x3c0 ? create_worker+0x1a0/0x1a0 kthread+0x12f/0x150 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x1f/0x40