Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13892

LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.6
    • Lustre 2.12.5
    • router node
      lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
      3.10.0-1127.18.2.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      lctl and lnetctl commands hang.

      console log messages on router:

      [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 

      Watchdog complains about hung kernel tasks:

      [Thu Aug  6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
      [Thu Aug  6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Aug  6 13:47:28 2020] kworker/43:1    D ffff98cc7df2e2a0     0   584      2 0x00000080
      [Thu Aug  6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
      [Thu Aug  6 13:47:28 2020] Call Trace:
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cca01>] kthread+0xd1/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      

      utility commands hang on router with these stacks:

      [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done       
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      21845      1  0 13:49 ?        00:00:00 lctl list_nids
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      25481      1  0 14:07 ?        00:00:00 lctl ping 192.168.128.4@o2ib36
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      18055  25762  0 14:28 pts/3    00:00:00 lnetctl lnet configure
      [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
      [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.

      Attachments

        1. dk.corona86.1596747314.txt
          235 kB
        2. dk.corona86.1596747437.txt
          1.32 MB
        3. dk.corona86.1596748295.txt
          6.05 MB
        4. dmesg.corona86.1596747297.txt
          271 kB
        5. dmesg.corona86.1596748292.txt
          272 kB
        6. dmesg.corona86.1596750724.txt
          276 kB
        7. pass10.sens100.retry1.check1.timeout100.tgz
          18.63 MB
        8. pass11.sens0.retry0.check1.timeout50.tgz
          26.60 MB
        9. pass12.dmesg.corona85.1600988107.out
          838 kB
        10. pass5.check_1.tgz
          15.15 MB
        11. pass6.check_0.tgz
          19.43 MB
        12. pass9.check_1.health_0.tgz
          20.33 MB

        Activity

          [LU-13892] LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          ofaaland Olaf Faaland made changes -
          Description Original: lctl and lnetctl commands hang.

          console log messages on router:
          {noformat}
          [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426: )) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 {noformat}
          Watchdog complains about hung kernel tasks:
          {noformat}
          [Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
          [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080
          [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
          [Thu Aug 6 13:47:28 2020] Call Trace:
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          {noformat}
          utility commands hang on router with these stacks:
          {noformat}
          [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done
          UID PID PPID C STIME TTY TIME CMD
          root 21845 1 0 13:49 ? 00:00:00 lctl list_nids
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure
          [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
          [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff
          {noformat}
          The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.
          New: lctl and lnetctl commands hang.

          console log messages on router:
          {noformat}
          [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 {noformat}
          Watchdog complains about hung kernel tasks:
          {noformat}
          [Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
          [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080
          [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
          [Thu Aug 6 13:47:28 2020] Call Trace:
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          {noformat}
          utility commands hang on router with these stacks:
          {noformat}
          [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done
          UID PID PPID C STIME TTY TIME CMD
          root 21845 1 0 13:49 ? 00:00:00 lctl list_nids
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure
          [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
          [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff
          {noformat}
          The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.
          ruth.klundt@gmail.com Ruth Klundt (Inactive) made changes -
          Description Original: lctl and lnetctl commands hang.

          console log messages on router:
          {noformat}
          [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 {noformat}

          Watchdog complains about hung kernel tasks:
          {noformat}
          [Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
          [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080
          [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
          [Thu Aug 6 13:47:28 2020] Call Trace:
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          {noformat}
          utility commands hang on router with these stacks:
          {noformat}
          [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done
          UID PID PPID C STIME TTY TIME CMD
          root 21845 1 0 13:49 ? 00:00:00 lctl list_nids
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure
          [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
          [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff
          {noformat}

          The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.
          New: lctl and lnetctl commands hang.

          console log messages on router:
          {noformat}
          [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
          [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426: )) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
          [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 {noformat}
          Watchdog complains about hung kernel tasks:
          {noformat}
          [Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
          [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080
          [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
          [Thu Aug 6 13:47:28 2020] Call Trace:
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
          [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
          [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
          [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
          [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
          {noformat}
          utility commands hang on router with these stacks:
          {noformat}
          [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done
          UID PID PPID C STIME TTY TIME CMD
          root 21845 1 0 13:49 ? 00:00:00 lctl list_nids
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36
          [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
          [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff

          UID PID PPID C STIME TTY TIME CMD
          root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure
          [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
          [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
          [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
          [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
          [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
          [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
          [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
          [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
          [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
          [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
          [<ffffffffffffffff>] 0xffffffffffffffff
          {noformat}
          The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.
          ofaaland Olaf Faaland made changes -
          Labels Original: llnl topllnl New: llnl
          pjones Peter Jones made changes -
          Link New: This issue is related to JFC-20 [ JFC-20 ]
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.12.6 [ 14707 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.12.6

          pjones Peter Jones added a comment - Landed for 2.12.6

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: 877d95b582db3d182d13dac4947c1f43b0e851dc

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/ Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 877d95b582db3d182d13dac4947c1f43b0e851dc

          Hi Serguei,
          OK, thanks.

          ofaaland Olaf Faaland added a comment - Hi Serguei, OK, thanks.

          Hi Olaf,

          It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex. Thanks, Serguei.

          People

            ssmirnov Serguei Smirnov
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: