[LU-15534] failed to ping 172.19.1.27@o2ib100: Input/output error Created: 08/Feb/22 Updated: 11/Feb/22 Resolved: 10/Feb/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
TOSS 4.3 (based on RHEL 8.5) |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet. We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:
When LNet starts, it begins reporting the following on the console: LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110 Eventually, we see the following on the console: INFO: task kworker/u128:2:5350 blocked for more than 120 seconds. Tainted: P OE --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u128:2 state:D stack: 0 pid: 5350 ppid: 2 flags:0x80004080 Workqueue: rdma_cm cma_work_handler [rdma_cm] Call Trace: __schedule+0x2c0/0x770 schedule+0x4c/0xc0 schedule_preempt_disabled+0x11/0x20 __mutex_lock.isra.6+0x343/0x550 rdma_connect+0x1e/0x40 [rdma_cm] kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd] ? __switch_to_asm+0x41/0x70 cma_cm_event_handler+0x25/0xf0 [rdma_cm] cma_work_handler+0x5a/0xb0 [rdma_cm] process_one_work+0x1ae/0x3a0 worker_thread+0x3c/0x3c0 ? create_worker+0x1a0/0x1a0 kthread+0x12f/0x150 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x1f/0x40 |
| Comments |
| Comment by Gian-Carlo Defazio [ 08/Feb/22 ] |
|
For my notes the local ticket is at https://lc.llnl.gov/jira/browse/TOSS-5521 |
| Comment by Gian-Carlo Defazio [ 08/Feb/22 ] |
|
So far we've seen this issue only with the RHEL 8.5 kernel and lustre 2.14. The previous version of TOSS, TOSS 4.2-4 is based on RHEL 8.4 and doesn't have this issue. We also haven't seen it on any TOSS 3 systems which are based on RHEL 7.X and running lustre 2.12 or 2.10. |
| Comment by Gian-Carlo Defazio [ 08/Feb/22 ] |
|
This was first noticed on our new storage hardware, which includes the garter cluster. [root@garter1:~]# ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.31.1014 Hardware version: 0 Node GUID: 0x0c42a103008ee90a System image GUID: 0x0c42a103008ee90a Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 391 LMC: 0 SM lid: 363 Capability mask: 0x2659e848 Port GUID: 0x0c42a103008ee90a Link layer: InfiniBand The subnet manager listed, orelic1, is correct [root@garter1:~]# ibnetdiscover | grep "lid 363" [2] "H-506b4b0300da6764"[1](506b4b0300da6764) # "orelic1 mlx5_0" lid 363 4xEDR [1](506b4b0300da6764) "S-248a0703006d13c0"[2] # lid 363 lmc 0 "SwitchIB Mellanox Technologies" lid 352 4xEDR The issue is also preset on the boa cluster, which has the same hardware as garter [root@boai:defazio1]# ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.31.1014 Hardware version: 0 Node GUID: 0x0c42a10300dace36 System image GUID: 0x0c42a10300dace36 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 228 LMC: 0 SM lid: 5 Capability mask: 0x2659e848 Port GUID: 0x0c42a10300dace36 Link layer: InfiniBand It also has the correct subnet manager, zrelic1 [root@boai:defazio1]# ibnetdiscover | grep "lid 5 " [5] "H-7cfe9003000f382e"[1](7cfe9003000f382e) # "zrelic1 mlx5_0" lid 5 4xEDR [1](7cfe9003000f382e) "S-7cfe900300b67590"[5] # lid 5 lmc 0 "SwitchIB Mellanox Technologies" lid 23 4xEDR An older cluster, slag, has the same issue as well. [root@slag3:~]# ibstat CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.28.2006 Hardware version: 0 Node GUID: 0x506b4b0300c23712 System image GUID: 0x506b4b0300c23712 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 359 LMC: 0 SM lid: 363 Capability mask: 0x2659e848 Port GUID: 0x506b4b0300c23712 Link layer: InfiniBand |
| Comment by Olaf Faaland [ 08/Feb/22 ] |
|
We do not see this issue with |
| Comment by Peter Jones [ 08/Feb/22 ] |
|
Serguei Could you please assist with this one? Peter |
| Comment by Serguei Smirnov [ 08/Feb/22 ] |
|
Hi, Could you please provide net debug for the failing ping test? lctl set_param debug=+net <--- run test ---> lctl dk > log.txt lctl set_param debug=-net Also, could you please provide the configuration script log? Thanks, Serguei. |
| Comment by Gian-Carlo Defazio [ 08/Feb/22 ] |
|
Hi Serguei, I've uploaded some files. 2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion. I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48. |
| Comment by Serguei Smirnov [ 08/Feb/22 ] |
|
This looks very similar to Which MOFED version are you using? Thanks, Serguei |
| Comment by Gian-Carlo Defazio [ 08/Feb/22 ] |
|
We are using OFED.
|
| Comment by Gian-Carlo Defazio [ 09/Feb/22 ] |
|
Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1. |
| Comment by Gian-Carlo Defazio [ 10/Feb/22 ] |
|
Applying Thanks!
|
| Comment by Gian-Carlo Defazio [ 10/Feb/22 ] |
|
The issue was fixed by an existing patch that was landed for 2.15. Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue. |