Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15295

LNet tx/rx timeout with routers

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.7
    • None
    • CentOS 7.9
    • 3
    • 9223372036854775807

    Description

      I wanted to report a problem also related to LNet but I think could be different than the high refs count issue on stuck routers reported in LU-14652 (not seen here) and hopefully fixed by the patch provided in LU-15234 by Chris Horn. In this other case, as seen today, we get many LNet timeout errors, leading to disconnect/reconnect of clients, slowness and tickets from users. We have transaction_timeout set to 50 (no retry). at_min=75, timeout=100. The issue seems to be related to a single router between two IB HDR fabrics: o2ib5 (Oak, Lustre servers) and o2ib3 (Sherlock 3, compute nodes / clients). This is the topology:

      *@o2ib5 servers <> 10.0.2.[216-217]@o2ib5 / 10.51.0.[132-131]@o2ib3 routers <> *@o2ib3 clients

      From the clients, the problem is typically noticed with errors like these:

      Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
      Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (0): c: 0, oc: 1, rc: 8
      Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
      Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
      Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (1): c: 0, oc: 2, rc: 8
      Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
      Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (6): c: 0, oc: 1, rc: 8
      Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: 2552:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1638340518/real 1638340524]  req@ffffa0b58e61ec00 x1711246854620928/t0(0) o400->oak-
      Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: Skipped 50 previous similar messages
      Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      

      The problem can be seen on the servers with the following errors, for example from an OSS 10.0.2.118@o2ib5:

      Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff997bbb3a9000
      Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
      Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
      Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 3, status -103, desc ffff99c2245edc00
      Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 34635:0:(ldlm_lib.c:3362:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff99d6f1953050 x1711214487574208/t0(0) o4->f876dd06-7d3a-b247-9185-2267ed691876@10.51.2.43@o2ib3:532/0 lens 488/448 e 0 to 0 dl 1638340717 ref 1 fl Interpret:/0/0 rc 0/0
      

      The suspect router itself (10.51.0.132@o2ib3) shows lots of Rx failed towards o2ib3 (the clients):

      00000800:00000100:15.0:1638339585.687821:0:28664:0:(o2iblnd_cb.c:2881:kiblnd_check_reconnect()) 10.51.2.43@o2ib3: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
      00000800:00000100:17.0:1638339585.687858:0:4020:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.51.2.43@o2ib3 failed: 5
      

      also RDMA against the servers:

      00000800:00020000:31.0:1638339631.216651:0:4019:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue_nocred, 0 seconds
      00000800:00020000:31.0:1638339631.216653:0:4019:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.0.2.114@o2ib5 (6): c: 0, oc: 0, rc: 7
      00000800:00000100:31.0:1638339631.216655:0:4019:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 10.0.2.114@o2ib5: error -110(sending_nocred)(waiting)
      

      I'm attaching +net dk log of the router containing the messages above as sh03-oak02.dk.log.gz

      If I check the peers on o2ib3 (the clients), the queued messages are mainly with router 10.51.0.132@o2ib3 (I checked multiple times and it's clear):

      # clush -u 5 -w @island:3\&@compute "lctl get_param peers | awk '/up/ && \$10 > 0 {print}'" | sort -k3nr
      sh03-05n57: 10.51.0.132@o2ib3         574    up    -1     8     8     8  -562 -4853 173368
      sh03-06n37: 10.51.0.132@o2ib3         572    up    -1     8     8     8  -560 -5437 168720
      sh03-04n38: 10.51.0.132@o2ib3         569    up    -1     8     8     8  -557 -13788 167536
      sh03-02n56: 10.51.0.132@o2ib3         479    up    -1     8     8     8  -461 -9506 6428640
      sh03-04n38: 10.51.0.131@o2ib3         476    up    -1     8     8     8  -464 -13727 140896
      sh03-06n66: 10.51.0.132@o2ib3         434    up    -1     8     8     8  -422 -9398 129056
      sh03-02n48: 10.51.0.132@o2ib3         384    up    -1     8     8     8  -372 -5318 113072
      sh03-06n66: 10.51.0.131@o2ib3         384    up    -1     8     8     8  -372 -9363 113664
      sh03-12n11: 10.51.0.132@o2ib3         380    up    -1     8     8     8  -368 -5749 116984
      sh03-15n10: 10.51.0.132@o2ib3         318    up    -1     8     8     8  -306 -4950 92944
      sh03-02n42: 10.51.0.132@o2ib3         302    up    -1     8     8     8  -281 -9604 11618384
      sh03-05n14: 10.51.0.132@o2ib3         294    up    -1     8     8     8  -282 -1379 88440
      sh03-05n62: 10.51.0.132@o2ib3         273    up    -1     8     8     8  -261 -5518 79888
      sh03-02n01: 10.51.0.132@o2ib3         266    up    -1     8     8     8  -254 -3913 77552
      sh03-12n20: 10.51.0.132@o2ib3         264    up    -1     8     8     8  -252 -9347 77488
      sh03-02n41: 10.51.0.132@o2ib3         230    up    -1     8     8     8  -214 -1963 6356352
      sh03-02n53: 10.51.0.132@o2ib3         219    up    -1     8     8     8  -207 -6578 63936
      sh03-16n22: 10.51.0.132@o2ib3         199    up    -1     8     8     8  -187 -9558 57720
      sh03-02n57: 10.51.0.132@o2ib3         195    up    -1     8     8     8  -182 -5138 3200560
      sh03-06n21: 10.51.0.132@o2ib3         135    up    -1     8     8     8  -123 -4791 38776
      sh03-02n56: 10.51.0.131@o2ib3         127    up    -1     8     8     8  -109 -9505 6329512
      sh03-02n54: 10.51.0.132@o2ib3         112    up    -1     8     8     8   -91 -14602 4173560
      sh03-02n49: 10.51.0.132@o2ib3          95    up    -1     8     8     8   -78 -5757 10510032
      sh03-02n66: 10.51.0.132@o2ib3          92    up    -1     8     8     8   -51 -1255 49290688
      sh03-05n29: 10.51.0.132@o2ib3          91    up    -1     8     8     8   -79 -6567 25752
      sh03-01n07: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2130 25456
      sh03-04n48: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -4891 25456
      sh03-06n55: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2319 25456
      sh03-06n32: 10.51.0.132@o2ib3          88    up    -1     8     8     8   -76 -6881 24864
      sh03-05n01: 10.51.0.132@o2ib3          85    up    -1     8     8     8   -73 -9823 23976
      sh03-02n72: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -72 -6336 23680
      sh03-06n36: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -71 -9006 58137
      sh03-04n24: 10.51.0.132@o2ib3          82    up    -1     8     8     8   -70 -4829 23088
      sh03-02n50: 10.51.0.132@o2ib3          81    up    -1     8     8     8   -37 -12602 45092400
      sh03-01n16: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -12919 22200
      sh03-02n59: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -11661 22200
      sh03-02n55: 10.51.0.132@o2ib3          78    up    -1     8     8     8   -63 -1779 2429624
      sh03-02n13: 10.51.0.132@o2ib3          76    up    -1     8     8     8   -64 -7355 21312
      sh03-06n33: 10.51.0.132@o2ib3          74    up    -1     8     8     8   -62 -17726 20952
      sh03-05n22: 10.51.0.132@o2ib3          73    up    -1     8     8     8   -61 -10495 20424
      sh03-01n18: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5547 20128
      sh03-04n42: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5387 20128
      sh03-02n44: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -1764 21064
      sh03-06n03: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -5982 20096
      sh03-06n26: 10.51.0.132@o2ib3          70    up    -1     8     8     8   -58 -6808 19536
      sh03-04n44: 10.51.0.132@o2ib3          63    up    -1     8     8     8   -51 -6094 17464
      sh03-12n24: 10.51.0.132@o2ib3          61    up    -1     8     8     8   -49 -4673 16872
      sh03-02n47: 10.51.0.132@o2ib3          55    up    -1     8     8     8   -27 -4385 26218872
      sh03-02n39: 10.51.0.132@o2ib3          52    up    -1     8     8     8   -40 -4968 14208
      sh03-02n54: 10.51.0.131@o2ib3          51    up    -1     8     8     8   -39 -14431 4158208
      sh03-06n35: 10.51.0.132@o2ib3          48    up    -1     8     8     8   -36 -4772 13024
      sh03-04n22: 10.51.0.132@o2ib3          45    up    -1     8     8     8   -29 -10100 7350704
      sh03-04n19: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -5526 10952
      sh03-05n02: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -2591 11056
      sh03-06n37: 10.51.0.131@o2ib3          38    up    -1     8     8     8   -26 -5438 10064
      sh03-05n57: 10.51.0.131@o2ib3          36    up    -1     8     8     8   -24 -4852 9736
      sh03-06n14: 10.51.0.132@o2ib3          34    up    -1     8     8     8   -22 -5228 9672
      sh03-04n36: 10.51.0.132@o2ib3          32    up    -1     8     8     8   -20 -6917 8288
      sh03-02n48: 10.51.0.131@o2ib3          31    up    -1     8     8     8   -19 -5319 7992
      sh03-02n42: 10.51.0.131@o2ib3          28    up    -1     8     8     8    -8 -9221 5858432
      sh03-05n04: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -5320 6216
      sh03-05n37: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -4483 6576
      sh03-02n66: 10.51.0.131@o2ib3          24    up    -1     8     8     8     7 -1886 1048648
      sh03-06n23: 10.51.0.132@o2ib3          23    up    -1     8     8     8   -11 -17351 5624
      sh03-02n50: 10.51.0.131@o2ib3          21    up    -1     8     8     8     5 -12627 3145944
      sh03-04n69: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -5736 4736
      sh03-05n25: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -7456 4736
      sh03-16n22: 10.51.0.131@o2ib3          19    up    -1     8     8     8    -7 -10269 4560
      sh03-06n30: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -4972 2368
      sh03-06n34: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -6922 2368
      sh03-02n46: 10.51.0.132@o2ib3           9    up    -1     8     8     8     4 -7325 4194592
      sh03-06n64: 10.51.0.132@o2ib3           7    up    -1     8     8     8     5 -2482 888
      sh03-14n24: 10.51.0.132@o2ib3           5    up    -1     8     8     8     7 -13642 4320
      

      From a client again, this is the peer state of the router:

      [root@sh03-04n19 ~]# lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
      peer:
          - primary nid: 10.51.0.132@o2ib3
            Multi-Rail: True
            peer state: 75
            peer ni:
              - nid: 10.51.0.132@o2ib3
                state: up
                max_ni_tx_credits: 8
                available_tx_credits: 8
                min_tx_credits: -5526
                tx_q_num_of_buf: 0
                available_rtr_credits: 8
                min_rtr_credits: 8
                refcount: 4
                statistics:
                    send_count: 19788794
                    recv_count: 21223791
                    drop_count: 0
                sent_stats:
                    put: 19537434
                    get: 251360
                    reply: 0
                    ack: 0
                    hello: 0
                received_stats:
                    put: 20948617
                    get: 170582
                    reply: 104585
                    ack: 7
                    hello: 0
                dropped_stats:
                    put: 0
                    get: 0
                    reply: 0
                    ack: 0
                    hello: 0
                health stats:
                    health value: 1000
                    dropped: 14
                    timeout: 17
                    error: 0
                    network timeout: 0
              - nid: 10.0.2.217@o2ib5
                state: NA
                max_ni_tx_credits: 0
                available_tx_credits: 0
                min_tx_credits: 0
                tx_q_num_of_buf: 0
                available_rtr_credits: 0
                min_rtr_credits: 0
                refcount: 2
                statistics:
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                sent_stats:
                    put: 0
                    get: 0
                    reply: 0
                    ack: 0
                    hello: 0
                received_stats:
                    put: 0
                    get: 0
                    reply: 0
                    ack: 0
                    hello: 0
                dropped_stats:
                    put: 0
                    get: 0
                    reply: 0
                    ack: 0
                    hello: 0
                health stats:
                    health value: 1000
                    dropped: 0
                    timeout: 0
                    error: 0
                    network timeout: 0
      

      Our lnet config is homogenous with:

      [root@sh03-04n19 ~]# lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 50
          health_sensitivity: 0
          recovery_interval: 1
      

      And pretty much using default tunables:

      [root@sh03-04n19 ~]# lnetctl net show -v
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
                statistics:
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                tunables:
                    peer_timeout: 0
                    peer_credits: 0
                    peer_buffer_credits: 0
                    credits: 0
                dev cpt: 0
                tcp bonding: 0
                CPT: "[0,1]"
          - net type: o2ib3
            local NI(s):
              - nid: 10.51.4.19@o2ib3
                status: up
                interfaces:
                    0: ib0
                statistics:
                    send_count: 267775369
                    recv_count: 266668619
                    drop_count: 59
                tunables:
                    peer_timeout: 180
                    peer_credits: 8
                    peer_buffer_credits: 0
                    credits: 256
                    peercredits_hiw: 4
                    map_on_demand: 0
                    concurrent_sends: 8
                    fmr_pool_size: 512
                    fmr_flush_trigger: 384
                    fmr_cache: 1
                    ntx: 512
                    conns_per_peer: 1
                lnd tunables:
                dev cpt: 1
                tcp bonding: 0
                CPT: "[0,1]"
      

      Note that discovery is disabled on the routers.

      The storage system at o2ib5 is connected to 3 other fabrics through routers without issues at the moment (1 x tcp and 2 other IB fabrics with even more clients), so it's definitely a problem with o2ib3 routers or clients, likely 10.51.0.132@o2ib3. Let me know if you have idea on how to troubleshoot these problems. We're now running 2.12.7 everywhere. Thanks!

      Attachments

        1. clients_lnetctl_net_show.txt
          7 kB
        2. clients_stats_show.txt
          2 kB
        3. oak_oss_lnetctl_net_show.txt
          29 kB
        4. oak_oss_lnetctl_route.txt
          7 kB
        5. oak_oss_peer_show_131.txt
          23 kB
        6. oak_oss_peer_show_132.txt
          23 kB
        7. oak_oss_stats_show.txt
          8 kB
        8. routers_lnetctl_net_show.txt
          8 kB
        9. routers_stats_show.txt
          1 kB
        10. sh03-oak02.dk.log.gz
          4.36 MB

        Activity

          People

            ssmirnov Serguei Smirnov
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: