[LU-15534] failed to ping 172.19.1.27@o2ib100: Input/output error Created: 08/Feb/22  Updated: 11/Feb/22  Resolved: 10/Feb/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

TOSS 4.3 (based on RHEL 8.5)
4.18.0-348.7.1.1toss.t4.x86_64
lustre 2.14.0_10.llnl


Attachments: File 2.14.0_10.llnl-x86_64-build.log.gz     File 2.14.0_10.llnl-x86_64-config.log.gz     HTML File garter5_ping-send_2022-02-08_10-53-41     HTML File garter6_ping-receive_2022-02-08_10-53-48    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet.

We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:

  • "ping" (the unix utility) between the two nodes via IPoIB is successful, in either direction
  • ib_write_bw between the two nodes via the IB network is successful, in either direction

When LNet starts, it begins reporting the following on the console:

LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110

Eventually, we see the following on the console:

INFO: task kworker/u128:2:5350 blocked for more than 120 seconds.
      Tainted: P           OE    --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:2  state:D stack:    0 pid: 5350 ppid:     2 flags:0x80004080
Workqueue: rdma_cm cma_work_handler [rdma_cm]
Call Trace:
 __schedule+0x2c0/0x770
 schedule+0x4c/0xc0
 schedule_preempt_disabled+0x11/0x20
 __mutex_lock.isra.6+0x343/0x550
 rdma_connect+0x1e/0x40 [rdma_cm]
 kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd]
 ? __switch_to_asm+0x41/0x70
 cma_cm_event_handler+0x25/0xf0 [rdma_cm]
 cma_work_handler+0x5a/0xb0 [rdma_cm]
 process_one_work+0x1ae/0x3a0
 worker_thread+0x3c/0x3c0
 ? create_worker+0x1a0/0x1a0
 kthread+0x12f/0x150
 ? kthread_flush_work_fn+0x10/0x10
 ret_from_fork+0x1f/0x40 


 Comments   
Comment by Gian-Carlo Defazio [ 08/Feb/22 ]

For my notes the local ticket is at https://lc.llnl.gov/jira/browse/TOSS-5521

Comment by Gian-Carlo Defazio [ 08/Feb/22 ]

So far we've seen this issue only with the RHEL 8.5 kernel and lustre 2.14.

The previous version of TOSS, TOSS 4.2-4 is based on RHEL 8.4 and doesn't have this issue. We also haven't seen it on any TOSS 3 systems which are based on RHEL 7.X and running lustre 2.12 or 2.10.

Comment by Gian-Carlo Defazio [ 08/Feb/22 ]

This was first noticed on our new storage hardware, which includes the garter cluster.

[root@garter1:~]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x0c42a103008ee90a
        System image GUID: 0x0c42a103008ee90a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 391
                LMC: 0
                SM lid: 363
                Capability mask: 0x2659e848
                Port GUID: 0x0c42a103008ee90a
                Link layer: InfiniBand

The subnet manager listed, orelic1, is correct

[root@garter1:~]# ibnetdiscover | grep "lid 363"
[2]     "H-506b4b0300da6764"[1](506b4b0300da6764)               # "orelic1 mlx5_0" lid 363 4xEDR
[1](506b4b0300da6764)   "S-248a0703006d13c0"[2]         # lid 363 lmc 0 "SwitchIB Mellanox Technologies" lid 352 4xEDR

The issue is also preset on the boa cluster, which has the same hardware as garter

[root@boai:defazio1]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x0c42a10300dace36
        System image GUID: 0x0c42a10300dace36
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 228
                LMC: 0
                SM lid: 5
                Capability mask: 0x2659e848
                Port GUID: 0x0c42a10300dace36
                Link layer: InfiniBand

It also has the correct subnet manager, zrelic1

[root@boai:defazio1]# ibnetdiscover | grep "lid 5 "
[5]     "H-7cfe9003000f382e"[1](7cfe9003000f382e)               # "zrelic1 mlx5_0" lid 5 4xEDR
[1](7cfe9003000f382e)   "S-7cfe900300b67590"[5]         # lid 5 lmc 0 "SwitchIB Mellanox Technologies" lid 23 4xEDR

An older cluster, slag, has the same issue as well.

[root@slag3:~]# ibstat
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.28.2006
        Hardware version: 0
        Node GUID: 0x506b4b0300c23712
        System image GUID: 0x506b4b0300c23712
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 359
                LMC: 0
                SM lid: 363
                Capability mask: 0x2659e848
                Port GUID: 0x506b4b0300c23712
                Link layer: InfiniBand
Comment by Olaf Faaland [ 08/Feb/22 ]

We do not see this issue with
RHEL 8.5
kernel 4.18.0-348.7.1.1toss.t4.x86_64
lustre-2.12.8_1.llnl-1.t4.x86_64

Comment by Peter Jones [ 08/Feb/22 ]

Serguei

Could you please assist with this one?

Peter

Comment by Serguei Smirnov [ 08/Feb/22 ]

Hi,

Could you please provide net debug for the failing ping test?

lctl set_param debug=+net
<--- run test --->
lctl dk > log.txt
lctl set_param debug=-net

Also, could you please provide the configuration script log?

Thanks,

Serguei.

Comment by Gian-Carlo Defazio [ 08/Feb/22 ]

Hi Serguei,

I've uploaded some files.

2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion.

I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.

Comment by Serguei Smirnov [ 08/Feb/22 ]

This looks very similar to LU-14488 the fix for which appears in 2.12.7. 

Which MOFED version are you using?

Thanks,

Serguei

Comment by Gian-Carlo Defazio [ 08/Feb/22 ]

We are using OFED.

 

Comment by Gian-Carlo Defazio [ 09/Feb/22 ]

LU-14488 looks promising.

Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.

Comment by Gian-Carlo Defazio [ 10/Feb/22 ]

Applying  LU-14488 to our local 2.14 branch solved the issue. It looks like it was LU-14488

Thanks!

 

Comment by Gian-Carlo Defazio [ 10/Feb/22 ]

The issue was fixed by an existing patch that was landed for 2.15.

Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.

Generated at Sat Feb 10 03:19:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.