Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15534

failed to ping 172.19.1.27@o2ib100: Input/output error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.14.0
    • TOSS 4.3 (based on RHEL 8.5)
      4.18.0-348.7.1.1toss.t4.x86_64
      lustre 2.14.0_10.llnl
    • 3
    • 9223372036854775807

    Description

      Since installing TOSS 4.3-X (a RHEL 8.5 derivative) on lustre servers, we've had an issue with lnet.

      We are unable to successfully lnetctl ping between nodes when using infiniband as the underlying network. There is no indication of problems with IB:

      • "ping" (the unix utility) between the two nodes via IPoIB is successful, in either direction
      • ib_write_bw between the two nodes via the IB network is successful, in either direction

      When LNet starts, it begins reporting the following on the console:

      LNetError: 24429:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.1.54@o2ib100) recovery failed with -110
      

      Eventually, we see the following on the console:

      INFO: task kworker/u128:2:5350 blocked for more than 120 seconds.
            Tainted: P           OE    --------- -t - 4.18.0-348.7.1.1toss.t4.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/u128:2  state:D stack:    0 pid: 5350 ppid:     2 flags:0x80004080
      Workqueue: rdma_cm cma_work_handler [rdma_cm]
      Call Trace:
       __schedule+0x2c0/0x770
       schedule+0x4c/0xc0
       schedule_preempt_disabled+0x11/0x20
       __mutex_lock.isra.6+0x343/0x550
       rdma_connect+0x1e/0x40 [rdma_cm]
       kiblnd_cm_callback+0x14ee/0x2230 [ko2iblnd]
       ? __switch_to_asm+0x41/0x70
       cma_cm_event_handler+0x25/0xf0 [rdma_cm]
       cma_work_handler+0x5a/0xb0 [rdma_cm]
       process_one_work+0x1ae/0x3a0
       worker_thread+0x3c/0x3c0
       ? create_worker+0x1a0/0x1a0
       kthread+0x12f/0x150
       ? kthread_flush_work_fn+0x10/0x10
       ret_from_fork+0x1f/0x40 

      Attachments

        1. 2.14.0_10.llnl-x86_64-build.log.gz
          60 kB
          Gian-Carlo Defazio
        2. 2.14.0_10.llnl-x86_64-config.log.gz
          7 kB
          Gian-Carlo Defazio
        3. garter5_ping-send_2022-02-08_10-53-41
          309 kB
          Gian-Carlo Defazio
        4. garter6_ping-receive_2022-02-08_10-53-48
          264 kB
          Gian-Carlo Defazio

        Activity

          People

            ssmirnov Serguei Smirnov
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: