Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13892

LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.6
    • Lustre 2.12.5
    • router node
      lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
      3.10.0-1127.18.2.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      lctl and lnetctl commands hang.

      console log messages on router:

      [Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
      [Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 

      Watchdog complains about hung kernel tasks:

      [Thu Aug  6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
      [Thu Aug  6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Aug  6 13:47:28 2020] kworker/43:1    D ffff98cc7df2e2a0     0   584      2 0x00000080
      [Thu Aug  6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
      [Thu Aug  6 13:47:28 2020] Call Trace:
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
      [Thu Aug  6 13:47:28 2020]  [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cca01>] kthread+0xd1/0xe0
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
      [Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
      

      utility commands hang on router with these stacks:

      [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done       
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      21845      1  0 13:49 ?        00:00:00 lctl list_nids
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      25481      1  0 14:07 ?        00:00:00 lctl ping 192.168.128.4@o2ib36
      [<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
      [<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      UID         PID   PPID  C STIME TTY          TIME CMD
      root      18055  25762  0 14:28 pts/3    00:00:00 lnetctl lnet configure
      [<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
      [<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
      [<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
      [<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
      [<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
      [<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
      [<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
      [<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.

      Attachments

        1. dk.corona86.1596747314.txt
          235 kB
        2. dk.corona86.1596747437.txt
          1.32 MB
        3. dk.corona86.1596748295.txt
          6.05 MB
        4. dmesg.corona86.1596747297.txt
          271 kB
        5. dmesg.corona86.1596748292.txt
          272 kB
        6. dmesg.corona86.1596750724.txt
          276 kB
        7. pass10.sens100.retry1.check1.timeout100.tgz
          18.63 MB
        8. pass11.sens0.retry0.check1.timeout50.tgz
          26.60 MB
        9. pass12.dmesg.corona85.1600988107.out
          838 kB
        10. pass5.check_1.tgz
          15.15 MB
        11. pass6.check_0.tgz
          19.43 MB
        12. pass9.check_1.health_0.tgz
          20.33 MB

        Activity

          [LU-13892] LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8

          Hi Serguei,
          OK, thanks.

          ofaaland Olaf Faaland added a comment - Hi Serguei, OK, thanks.

          Hi Olaf,

          It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex. Thanks, Serguei.
          ofaaland Olaf Faaland added a comment -

          Serguei,

          Why is the "lock-up during router check" patch against b2_12 and not master?

          thanks

          ofaaland Olaf Faaland added a comment - Serguei, Why is the "lock-up during router check" patch against b2_12 and not master? thanks

          Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: db1905826983024a1bb4808c5ae93d6d32446b12

          gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172 Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: db1905826983024a1bb4808c5ae93d6d32446b12
          ofaaland Olaf Faaland added a comment -

          Hi Serguei,

          The patch looks to be working properly.

          Thanks!

           

          ofaaland Olaf Faaland added a comment - Hi Serguei, The patch looks to be working properly. Thanks!  

          Hi Olaf,

          You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log. Thanks, Serguei.

          Hi Serguei,

          Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate().  Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt.

          Thanks

          ofaaland Olaf Faaland added a comment - Hi Serguei, Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate().  Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt. Thanks

          Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059
          Subject: LU-13892 lnet: lock-up during router check
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 74097295693793ca3566dccfe96abf3a7e56a659

          gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059 Subject: LU-13892 lnet: lock-up during router check Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 74097295693793ca3566dccfe96abf3a7e56a659

          Hi Serguei,

          Yes, a debug patch would be fine.  Please just push it to gerrit so it goes through testing to show it doesn't itself blow things up, and so I can keep a record of where it came from.

          I've attached dmesg output including SysRq w content.  It's called pass12.dmesg.corona85.1600988107.out

          Thank you.

          ofaaland Olaf Faaland added a comment - Hi Serguei, Yes, a debug patch would be fine.  Please just push it to gerrit so it goes through testing to show it doesn't itself blow things up, and so I can keep a record of where it came from. I've attached dmesg output including SysRq w content.  It's called  pass12.dmesg.corona85.1600988107.out Thank you.

          Hi Olaf,

          Good to know this is not because of linux routing settings. Sysrq w output would be helpful. Regarding next steps, if you'd be ok with a debug patch, that would be great, as I still haven't seen this happen anywhere else.

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Olaf, Good to know this is not because of linux routing settings. Sysrq w output would be helpful. Regarding next steps, if you'd be ok with a debug patch, that would be great, as I still haven't seen this happen anywhere else. Thanks, Serguei.

          Hi Serguei,

          Reproduced even with the recommended arp settings.  Node was booted, and as you see lnetctl was hung.  I forgot to set debug=-1, so I can do that and reproduce if it's helpful. I did capture SysRq w, if you want that let me know.

          [root@corona81:~]# pdsh -w e85 systemctl status lnet
          e85: * lnet.service - lnet management
          e85:    Loaded: loaded (/etc/systemd/system/lnet.service; enabled; vendor preset: disabled)
          e85:    Active: activating (start) since Thu 2020-09-24 15:57:18 PDT; 3min 14s ago
          e85:   Process: 21710 ExecStart=/bin/grep --with-filename . /sys/module/lnet/parameters/lnet_health_sensitivity (code=exited, status=0/SUCCESS)
          e85:   Process: 21697 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS)
          e85:   Process: 21684 ExecStart=/tftpboot/dumps/set-arp-lnet set (code=exited, status=0/SUCCESS)
          e85:   Process: 21676 ExecStart=/bin/bash -c echo lnet for Lustre: Build Version: $(rpm -q lustre) | /usr/bin/logger -t lnet.service -p local7.info (code=exited, status=0/SUCCESS)
          e85:  Main PID: 21712 (lnetctl)
          e85:    CGroup: /system.slice/lnet.service
          e85:            `-21712 /usr/sbin/lnetctl lnet configure --all
          e85: 
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_filter = 0
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_announce = 2
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.rp_filter = 0
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_ignore = 1
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_filter = 0
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_announce = 2
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.rp_filter = 0
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.rp_filter = 0
          e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.accept_local = 1
          e85: Sep 24 15:57:18 corona85 grep[21710]: /sys/module/lnet/parameters/lnet_health_sensitivity:100

          What next? Another pass with debug=-1? Another pass with debug from both this router node and the peers described in its routes? A debug patch?

          thanks

          ofaaland Olaf Faaland added a comment - Hi Serguei, Reproduced even with the recommended arp settings.  Node was booted, and as you see lnetctl was hung.  I forgot to set debug=-1, so I can do that and reproduce if it's helpful. I did capture SysRq w, if you want that let me know. [root@corona81:~]# pdsh -w e85 systemctl status lnet e85: * lnet.service - lnet management e85:    Loaded: loaded (/etc/systemd/system/lnet.service; enabled; vendor preset: disabled) e85:    Active: activating (start) since Thu 2020-09-24 15:57:18 PDT; 3min 14s ago e85:   Process: 21710 ExecStart=/bin/grep --with-filename . /sys/module/lnet/parameters/lnet_health_sensitivity (code=exited, status=0/SUCCESS) e85:   Process: 21697 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS) e85:   Process: 21684 ExecStart=/tftpboot/dumps/set-arp-lnet set (code=exited, status=0/SUCCESS) e85:   Process: 21676 ExecStart=/bin/bash -c echo lnet for Lustre: Build Version: $(rpm -q lustre) | /usr/bin/logger -t lnet.service -p local7.info (code=exited, status=0/SUCCESS) e85:  Main PID: 21712 (lnetctl) e85:    CGroup: /system.slice/lnet.service e85:            `-21712 /usr/sbin/lnetctl lnet configure --all e85: e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_announce = 2 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_ignore = 1 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_announce = 2 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.accept_local = 1 e85: Sep 24 15:57:18 corona85 grep[21710]: /sys/module/lnet/parameters/lnet_health_sensitivity:100 What next? Another pass with debug=-1? Another pass with debug from both this router node and the peers described in its routes? A debug patch? thanks

          People

            ssmirnov Serguei Smirnov
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: