Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13892

LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.6
    • Lustre 2.12.5
    • router node
      lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
      3.10.0-1127.18.2.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      lctl and lnetctl commands hang.

      console log messages on router:

      [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 

      Watchdog complains about hung kernel tasks:

      [Thu Aug  6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
      [Thu Aug  6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Aug  6 13:47:28 2020] kworker/43:1    D ffff98cc7df2e2a0     0   584      2 0x00000080
      [Thu Aug  6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
      [Thu Aug  6 13:47:28 2020] Call Trace:
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cca01>] kthread+0xd1/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      

      utility commands hang on router with these stacks:

      [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done       
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      21845      1  0 13:49 ?        00:00:00 lctl list_nids
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      25481      1  0 14:07 ?        00:00:00 lctl ping 192.168.128.4@o2ib36
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      18055  25762  0 14:28 pts/3    00:00:00 lnetctl lnet configure
      [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
      [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.

      Attachments

        1. dk.corona86.1596747314.txt
          235 kB
          Olaf Faaland
        2. dk.corona86.1596747437.txt
          1.32 MB
          Olaf Faaland
        3. dk.corona86.1596748295.txt
          6.05 MB
          Olaf Faaland
        4. dmesg.corona86.1596747297.txt
          271 kB
          Olaf Faaland
        5. dmesg.corona86.1596748292.txt
          272 kB
          Olaf Faaland
        6. dmesg.corona86.1596750724.txt
          276 kB
          Olaf Faaland
        7. pass10.sens100.retry1.check1.timeout100.tgz
          18.63 MB
          Olaf Faaland
        8. pass11.sens0.retry0.check1.timeout50.tgz
          26.60 MB
          Olaf Faaland
        9. pass12.dmesg.corona85.1600988107.out
          838 kB
          Olaf Faaland
        10. pass5.check_1.tgz
          15.15 MB
          Olaf Faaland
        11. pass6.check_0.tgz
          19.43 MB
          Olaf Faaland
        12. pass9.check_1.health_0.tgz
          20.33 MB
          Olaf Faaland

        Activity

          [LU-13892] LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
          pjones Peter Jones added a comment -

          Landed for 2.12.6

          pjones Peter Jones added a comment - Landed for 2.12.6

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: 877d95b582db3d182d13dac4947c1f43b0e851dc

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/ Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 877d95b582db3d182d13dac4947c1f43b0e851dc

          Hi Serguei,
          OK, thanks.

          ofaaland Olaf Faaland added a comment - Hi Serguei, OK, thanks.

          Hi Olaf,

          It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex. Thanks, Serguei.
          ofaaland Olaf Faaland added a comment -

          Serguei,

          Why is the "lock-up during router check" patch against b2_12 and not master?

          thanks

          ofaaland Olaf Faaland added a comment - Serguei, Why is the "lock-up during router check" patch against b2_12 and not master? thanks

          Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: db1905826983024a1bb4808c5ae93d6d32446b12

          gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172 Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: db1905826983024a1bb4808c5ae93d6d32446b12
          ofaaland Olaf Faaland added a comment -

          Hi Serguei,

          The patch looks to be working properly.

          Thanks!

           

          ofaaland Olaf Faaland added a comment - Hi Serguei, The patch looks to be working properly. Thanks!  

          Hi Olaf,

          You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log. Thanks, Serguei.

          Hi Serguei,

          Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate().  Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt.

          Thanks

          ofaaland Olaf Faaland added a comment - Hi Serguei, Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate().  Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt. Thanks

          Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 74097295693793ca3566dccfe96abf3a7e56a659

          gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059 Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 74097295693793ca3566dccfe96abf3a7e56a659

          People

            ssmirnov Serguei Smirnov
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: