Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15534

failed to ping 172.19.1.27@o2ib100: Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.14.0
    • TOSS 4.3 (based on RHEL 8.5)
      4.18.0-348.7.1.1toss.t4.x86_64
      lustre 2.14.0_10.llnl
    • 3
    • 9223372036854775807

    Description

      Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet.

      We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:

      • "ping" (the unix utility) between the two nodes via IPoIB is successful, in either direction
      • ib_write_bw between the two nodes via the IB network is successful, in either direction

      When LNet starts, it begins reporting the following on the console:

      LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110
      

      Eventually, we see the following on the console:

      INFO: task kworker/u128:2:5350 blocked for more than 120 seconds.
            Tainted: P           OE    --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/u128:2  state:D stack:    0 pid: 5350 ppid:     2 flags:0x80004080
      Workqueue: rdma_cm cma_work_handler [rdma_cm]
      Call Trace:
       __schedule+0x2c0/0x770
       schedule+0x4c/0xc0
       schedule_preempt_disabled+0x11/0x20
       __mutex_lock.isra.6+0x343/0x550
       rdma_connect+0x1e/0x40 [rdma_cm]
       kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd]
       ? __switch_to_asm+0x41/0x70
       cma_cm_event_handler+0x25/0xf0 [rdma_cm]
       cma_work_handler+0x5a/0xb0 [rdma_cm]
       process_one_work+0x1ae/0x3a0
       worker_thread+0x3c/0x3c0
       ? create_worker+0x1a0/0x1a0
       kthread+0x12f/0x150
       ? kthread_flush_work_fn+0x10/0x10
       ret_from_fork+0x1f/0x40 

      Attachments

        Activity

          [LU-15534] failed to ping 172.19.1.27@o2ib100: Input/output error

          The issue was fixed by an existing patch that was landed for 2.15.

          Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.

          defazio Gian-Carlo Defazio added a comment - The issue was fixed by an existing patch that was landed for 2.15. Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.

          Applying  LU-14488 to our local 2.14 branch solved the issue. It looks like it was LU-14488

          Thanks!

           

          defazio Gian-Carlo Defazio added a comment - Applying   LU-14488 to our local 2.14 branch solved the issue. It looks like it was LU-14488 Thanks!  
          defazio Gian-Carlo Defazio added a comment - - edited

          LU-14488 looks promising.

          Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.

          defazio Gian-Carlo Defazio added a comment - - edited LU-14488 looks promising. Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.

          We are using OFED.

           

          defazio Gian-Carlo Defazio added a comment - We are using OFED.  

          This looks very similar to LU-14488 the fix for which appears in 2.12.7. 

          Which MOFED version are you using?

          Thanks,

          Serguei

          ssmirnov Serguei Smirnov added a comment - This looks very similar to LU-14488  the fix for which appears in 2.12.7.  Which MOFED version are you using? Thanks, Serguei

          Hi Serguei,

          I've uploaded some files.

          2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion.

          I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.

          defazio Gian-Carlo Defazio added a comment - Hi Serguei, I've uploaded some files. 2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion. I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.

          Hi,

          Could you please provide net debug for the failing ping test?

          lctl set_param debug=+net
          <--- run test --->
          lctl dk > log.txt
          lctl set_param debug=-net

          Also, could you please provide the configuration script log?

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi, Could you please provide net debug for the failing ping test? lctl set_param debug=+net <--- run test ---> lctl dk > log.txt lctl set_param debug=-net Also, could you please provide the configuration script log? Thanks, Serguei.
          pjones Peter Jones added a comment -

          Serguei

          Could you please assist with this one?

          Peter

          pjones Peter Jones added a comment - Serguei Could you please assist with this one? Peter
          ofaaland Olaf Faaland added a comment -

          We do not see this issue with
          RHEL 8.5
          kernel 4.18.0-348.7.1.1toss.t4.x86_64
          lustre-2.12.8_1.llnl-1.t4.x86_64

          ofaaland Olaf Faaland added a comment - We do not see this issue with RHEL 8.5 kernel 4.18.0-348.7.1.1toss.t4.x86_64 lustre-2.12.8_1.llnl-1.t4.x86_64

          This was first noticed on our new storage hardware, which includes the garter cluster.

          [root@garter1:~]# ibstat
          CA 'mlx5_0'
                  CA type: MT4119
                  Number of ports: 1
                  Firmware version: 16.31.1014
                  Hardware version: 0
                  Node GUID: 0x0c42a103008ee90a
                  System image GUID: 0x0c42a103008ee90a
                  Port 1:
                          State: Active
                          Physical state: LinkUp
                          Rate: 100
                          Base lid: 391
                          LMC: 0
                          SM lid: 363
                          Capability mask: 0x2659e848
                          Port GUID: 0x0c42a103008ee90a
                          Link layer: InfiniBand
          

          The subnet manager listed, orelic1, is correct

          [root@garter1:~]# ibnetdiscover | grep "lid 363"
          [2]     "H-506b4b0300da6764"[1](506b4b0300da6764)               # "orelic1 mlx5_0" lid 363 4xEDR
          [1](506b4b0300da6764)   "S-248a0703006d13c0"[2]         # lid 363 lmc 0 "SwitchIB Mellanox Technologies" lid 352 4xEDR
          

          The issue is also preset on the boa cluster, which has the same hardware as garter

          [root@boai:defazio1]# ibstat
          CA 'mlx5_0'
                  CA type: MT4119
                  Number of ports: 1
                  Firmware version: 16.31.1014
                  Hardware version: 0
                  Node GUID: 0x0c42a10300dace36
                  System image GUID: 0x0c42a10300dace36
                  Port 1:
                          State: Active
                          Physical state: LinkUp
                          Rate: 100
                          Base lid: 228
                          LMC: 0
                          SM lid: 5
                          Capability mask: 0x2659e848
                          Port GUID: 0x0c42a10300dace36
                          Link layer: InfiniBand
          

          It also has the correct subnet manager, zrelic1

          [root@boai:defazio1]# ibnetdiscover | grep "lid 5 "
          [5]     "H-7cfe9003000f382e"[1](7cfe9003000f382e)               # "zrelic1 mlx5_0" lid 5 4xEDR
          [1](7cfe9003000f382e)   "S-7cfe900300b67590"[5]         # lid 5 lmc 0 "SwitchIB Mellanox Technologies" lid 23 4xEDR
          

          An older cluster, slag, has the same issue as well.

          [root@slag3:~]# ibstat
          CA 'mlx5_0'
                  CA type: MT4115
                  Number of ports: 1
                  Firmware version: 12.28.2006
                  Hardware version: 0
                  Node GUID: 0x506b4b0300c23712
                  System image GUID: 0x506b4b0300c23712
                  Port 1:
                          State: Active
                          Physical state: LinkUp
                          Rate: 100
                          Base lid: 359
                          LMC: 0
                          SM lid: 363
                          Capability mask: 0x2659e848
                          Port GUID: 0x506b4b0300c23712
                          Link layer: InfiniBand
          
          defazio Gian-Carlo Defazio added a comment - This was first noticed on our new storage hardware, which includes the garter cluster. [root@garter1:~]# ibstat CA 'mlx5_0'         CA type: MT4119         Number of ports: 1         Firmware version: 16.31.1014         Hardware version: 0         Node GUID: 0x0c42a103008ee90a         System image GUID: 0x0c42a103008ee90a         Port 1:                 State: Active                 Physical state: LinkUp                 Rate: 100                 Base lid: 391                 LMC: 0                 SM lid: 363                 Capability mask: 0x2659e848                 Port GUID: 0x0c42a103008ee90a                 Link layer: InfiniBand The subnet manager listed, orelic1, is correct [root@garter1:~]# ibnetdiscover | grep "lid 363" [2]     "H-506b4b0300da6764"[1](506b4b0300da6764)               # "orelic1 mlx5_0" lid 363 4xEDR [1](506b4b0300da6764)   "S-248a0703006d13c0"[2]         # lid 363 lmc 0 "SwitchIB Mellanox Technologies" lid 352 4xEDR The issue is also preset on the boa cluster, which has the same hardware as garter [root@boai:defazio1]# ibstat CA 'mlx5_0'         CA type: MT4119         Number of ports: 1         Firmware version: 16.31.1014         Hardware version: 0         Node GUID: 0x0c42a10300dace36         System image GUID: 0x0c42a10300dace36         Port 1:                 State: Active                 Physical state: LinkUp                 Rate: 100                 Base lid: 228                 LMC: 0                 SM lid: 5                 Capability mask: 0x2659e848                 Port GUID: 0x0c42a10300dace36                 Link layer: InfiniBand It also has the correct subnet manager, zrelic1 [root@boai:defazio1]# ibnetdiscover | grep "lid 5 " [5]     "H-7cfe9003000f382e"[1](7cfe9003000f382e)               # "zrelic1 mlx5_0" lid 5 4xEDR [1](7cfe9003000f382e)   "S-7cfe900300b67590"[5]         # lid 5 lmc 0 "SwitchIB Mellanox Technologies" lid 23 4xEDR An older cluster, slag, has the same issue as well. [root@slag3:~]# ibstat CA 'mlx5_0'         CA type: MT4115         Number of ports: 1         Firmware version: 12.28.2006         Hardware version: 0         Node GUID: 0x506b4b0300c23712         System image GUID: 0x506b4b0300c23712         Port 1:                 State: Active                 Physical state: LinkUp                 Rate: 100                 Base lid: 359                 LMC: 0                 SM lid: 363                 Capability mask: 0x2659e848                 Port GUID: 0x506b4b0300c23712                 Link layer: InfiniBand

          People

            ssmirnov Serguei Smirnov
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: