Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14652

LNet router stuck generating RDMA tx timeout

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.6
    • None
    • CentOS 7.9 (3.10.0-1160.24.1.el7.x86_64) on routers, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      Hello, we are still seeing weird behavior with LNet over IB with 2.12+. We have tried to upgrade clients and routers to 2.13 and then 2.14 without success. We went back to 2.12.6 LTS, but we are still seeing occasional kiblnd errors and timeout. The IB fabrics are healthy, sometimes a little bit of congestion but no discard. I'm starting to suspect a deeper problem with LNet/ko2iblnd, where sometimes credits are exhausted? We didn't have with 2.10. To me, the problem seems similar to the one reported in LU-14026 by LLNL.

      We do have the following setup:

      Fir (serves /scratch) o2ib7 — 8 x lnet routers (IB) — Sherlock v3 (o2ib3)

      Last night, one of the 8 routers (sh03-fir06) started to have problems. I've taken traces so that we can investigate.

      Router NIDs:

      [root@sh03-fir06 ~]# lctl list_nids
      10.51.0.116@o2ib3
      10.0.10.237@o2ib7
      

      LNet config on the router (we have discovery enabled and using a few Multi-Rail nodes on o2ib3):

      [root@sh03-fir06 ~]# lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 50
          health_sensitivity: 0
          recovery_interval: 1
      

      Fir Lustre servers on o2ib7 started to exhibit the following errors:

      fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
      fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
      fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9480cfc4c800
      fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
      fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
      fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
      fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
      fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff945be5538400
      fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94886b2a9000
      fir-io7-s2: Apr 28 23:26:17 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94428fd56000
      fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
      fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
      fir-io7-s2: Apr 28 23:26:43 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9462486d2400
      fir-io7-s2: Apr 28 23:26:51 fir-io7-s2 kernel: LustreError: 68967:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff949f1a1a3850 x1696974434037056/t0(0) o4->12f8a639-7e97-4157-8d89-6e1e00a728eb@10.51.13.20@o2ib3:363/0 lens 488/448 e 0 to 0 dl 1619677703 ref 1 fl Interpret:/0/0 rc 0/0
      fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
      fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
      fir-io6-s1: Apr 28 23:24:18 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff91913b76a800
      fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
      fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
      fir-io6-s1: Apr 28 23:24:36 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911f0108a400
      fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
      fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
      fir-io6-s1: Apr 28 23:25:08 fir-io6-s1 kernel: LNet: 24015:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.0.10.237@o2ib7: error -110(waiting)
      

      One interesting thing about that router, vs. the 7 others, is that it had a lot of refs (> 3000) in {{/sys/kernel/debug/lnet/nis }} and tx stuck at -367. The high refs count is similar to an issue we noticed with 2.14 routers reported in LU-14584, and we thought that maybe this was a CPT issue. Here, it happened with Lustre 2.12.6 routers.

      nid                      status alive refs peer  rtr   max    tx   min
      0@lo                       down     0    2    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      0@lo                       down     0    0    0    0     0     0     0
      10.51.0.116@o2ib3            up     0    1    8    0    64    64    47
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    47
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    41
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    48
      10.51.0.116@o2ib3            up     0 3493    8    0    64  -367  -367
      10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
      10.0.10.237@o2ib7            up     0 3318    8    0    64    64    48
      10.0.10.237@o2ib7            up     0 3062    8    0    64    64    50
      10.0.10.237@o2ib7            up     0 6202    8    0    64    64    47
      10.0.10.237@o2ib7            up     0 3032    8    0    64    64    49
      10.0.10.237@o2ib7            up     0 6082    8    0    64    64    48
      10.0.10.237@o2ib7            up     0 5467    8    0    64    64    48
      10.0.10.237@o2ib7            up     0 3115    8    0    64    64    50
      10.0.10.237@o2ib7            up     0 3206    8    0    64    64    48
      

      I took traces on this router at the time of the problem. I'm attaching a zip files sh03-fir06-20210428.zip with the output of:

      • lnetctl stats show
      • lnetctl peer show
      • cat /sys/kernel/debug/lnet/nis
      • cat /sys/kernel/debug/lnet/peers
      • kernel logs
      • short dk logs with +net enabled, just in case that would show something interesting

      Rebooting the router fixed the problem.

      Attachments

        1. fir-io1-s1_lctl_ping_sync.log.gz
          13.23 MB
        2. fir-io1-s2_dk_net_dlmtrace_20211026.log.gz
          8.44 MB
        3. fir-io1-s2_kernel_sysrq-l_20211026.log.gz
          15 kB
        4. image-2021-11-09-15-37-32-277.png
          image-2021-11-09-15-37-32-277.png
          370 kB
        5. net_show_client_2.12.txt
          2 kB
        6. net_show_client_2.13.txt
          2 kB
        7. net_show_routers.txt
          31 kB
        8. net_show_servers.txt
          49 kB
        9. oak-io2-s1.dknet10.gz
          5.32 MB
        10. oak-io2-s1.dknet9.gz
          4.13 MB
        11. oak-io2-s1.kern.log
          112 kB
        12. sh02-14n15.dknet1.gz
          790 kB
        13. sh02-14n15.dknet2.gz
          350 kB
        14. sh02-14n15.dknet3.gz
          549 kB
        15. sh02-fir03.20211020.peers.txt
          63 kB
        16. sh02-fir03-debug-lctl_ping_sync.log.gz
          1.62 MB
        17. sh02-fir03-kern_sysrq-l-t_20211026.log.gz
          51 kB
        18. sh02-fir04-dknet.log.gz
          5.88 MB
        19. sh03-01n14.dknet.gz
          2.99 MB
        20. sh03-fir06-20210428.zip
          10.23 MB

        Activity

          People

            ssmirnov Serguei Smirnov
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: