Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13892

LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.6
    • Lustre 2.12.5
    • router node
      lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
      3.10.0-1127.18.2.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      lctl and lnetctl commands hang.

      console log messages on router:

      [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 

      Watchdog complains about hung kernel tasks:

      [Thu Aug  6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
      [Thu Aug  6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Aug  6 13:47:28 2020] kworker/43:1    D ffff98cc7df2e2a0     0   584      2 0x00000080
      [Thu Aug  6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
      [Thu Aug  6 13:47:28 2020] Call Trace:
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cca01>] kthread+0xd1/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      

      utility commands hang on router with these stacks:

      [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done       
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      21845      1  0 13:49 ?        00:00:00 lctl list_nids
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      25481      1  0 14:07 ?        00:00:00 lctl ping 192.168.128.4@o2ib36
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      18055  25762  0 14:28 pts/3    00:00:00 lnetctl lnet configure
      [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
      [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.

      Attachments

        1. dk.corona86.1596747314.txt
          235 kB
        2. dk.corona86.1596747437.txt
          1.32 MB
        3. dk.corona86.1596748295.txt
          6.05 MB
        4. dmesg.corona86.1596747297.txt
          271 kB
        5. dmesg.corona86.1596748292.txt
          272 kB
        6. dmesg.corona86.1596750724.txt
          276 kB
        7. pass10.sens100.retry1.check1.timeout100.tgz
          18.63 MB
        8. pass11.sens0.retry0.check1.timeout50.tgz
          26.60 MB
        9. pass12.dmesg.corona85.1600988107.out
          838 kB
        10. pass5.check_1.tgz
          15.15 MB
        11. pass6.check_0.tgz
          19.43 MB
        12. pass9.check_1.health_0.tgz
          20.33 MB

        Activity

          People

            ssmirnov Serguei Smirnov
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: