Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15534

failed to ping 172.19.1.27@o2ib100: Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.14.0
    • TOSS 4.3 (based on RHEL 8.5)
      4.18.0-348.7.1.1toss.t4.x86_64
      lustre 2.14.0_10.llnl
    • 3
    • 9223372036854775807

    Description

      Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet.

      We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:

      • "ping" (the unix utility) between the two nodes via IPoIB is successful, in either direction
      • ib_write_bw between the two nodes via the IB network is successful, in either direction

      When LNet starts, it begins reporting the following on the console:

      LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110
      

      Eventually, we see the following on the console:

      INFO: task kworker/u128:2:5350 blocked for more than 120 seconds.
            Tainted: P           OE    --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/u128:2  state:D stack:    0 pid: 5350 ppid:     2 flags:0x80004080
      Workqueue: rdma_cm cma_work_handler [rdma_cm]
      Call Trace:
       __schedule+0x2c0/0x770
       schedule+0x4c/0xc0
       schedule_preempt_disabled+0x11/0x20
       __mutex_lock.isra.6+0x343/0x550
       rdma_connect+0x1e/0x40 [rdma_cm]
       kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd]
       ? __switch_to_asm+0x41/0x70
       cma_cm_event_handler+0x25/0xf0 [rdma_cm]
       cma_work_handler+0x5a/0xb0 [rdma_cm]
       process_one_work+0x1ae/0x3a0
       worker_thread+0x3c/0x3c0
       ? create_worker+0x1a0/0x1a0
       kthread+0x12f/0x150
       ? kthread_flush_work_fn+0x10/0x10
       ret_from_fork+0x1f/0x40 

      Attachments

        Activity

          [LU-15534] failed to ping 172.19.1.27@o2ib100: Input/output error
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          defazio Gian-Carlo Defazio made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

          The issue was fixed by an existing patch that was landed for 2.15.

          Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.

          defazio Gian-Carlo Defazio added a comment - The issue was fixed by an existing patch that was landed for 2.15. Our local 2.12 branch already had the b2_12 backport of the patch and never experienced this issue.
          defazio Gian-Carlo Defazio made changes -
          Labels Original: llnl topllnl New: llnl

          Applying  LU-14488 to our local 2.14 branch solved the issue. It looks like it was LU-14488

          Thanks!

           

          defazio Gian-Carlo Defazio added a comment - Applying   LU-14488 to our local 2.14 branch solved the issue. It looks like it was LU-14488 Thanks!  
          defazio Gian-Carlo Defazio added a comment - - edited

          LU-14488 looks promising.

          Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.

          defazio Gian-Carlo Defazio added a comment - - edited LU-14488 looks promising. Looking at the source for the 2 kernels, I do not see rdma_connect_locked() in the 4.18.0-305.19.1.el8_4 kernel used to build TOSS 4.2-4, but I do see it in the 4.18.0-348.2.1.el8_5 kernel used to build TOSS 4.3-1.

          We are using OFED.

           

          defazio Gian-Carlo Defazio added a comment - We are using OFED.  

          This looks very similar to LU-14488 the fix for which appears in 2.12.7. 

          Which MOFED version are you using?

          Thanks,

          Serguei

          ssmirnov Serguei Smirnov added a comment - This looks very similar to LU-14488  the fix for which appears in 2.12.7.  Which MOFED version are you using? Thanks, Serguei

          Hi Serguei,

          I've uploaded some files.

          2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion.

          I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.

          defazio Gian-Carlo Defazio added a comment - Hi Serguei, I've uploaded some files. 2.14.0_10.llnl-x86_64-build.log.gz is the full build log and 2.14.0_10.llnl-x86_64-config.log.gz is from the same build but removes all but the configure portion. I did a lnetctl ping from garter5 (172.19.1.137@o2ib100) to garter6 (172.19.1.138@o2ib100) and included the debug logs for both in garter5_ping-send_2022-02-08_10-53-41 and garter6_ping-receive_2022-02-08_10-53-48.
          defazio Gian-Carlo Defazio made changes -
          Attachment New: 2.14.0_10.llnl-x86_64-config.log.gz [ 42264 ]

          People

            ssmirnov Serguei Smirnov
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: