Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.7
Labels:
None
Environment:
CentOS 7.9

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I wanted to report a problem also related to LNet but I think could be different than the high refs count issue on stuck routers reported in LU-14652 (not seen here) and hopefully fixed by the patch provided in ~~LU-15234~~ by Chris Horn. In this other case, as seen today, we get many LNet timeout errors, leading to disconnect/reconnect of clients, slowness and tickets from users. We have transaction_timeout set to 50 (no retry). at_min=75, timeout=100. The issue seems to be related to a single router between two IB HDR fabrics: o2ib5 (Oak, Lustre servers) and o2ib3 (Sherlock 3, compute nodes / clients). This is the topology:

*@o2ib5 servers <> 10.0.2.[216-217]@o2ib5 / 10.51.0.[132-131]@o2ib3 routers <> *@o2ib3 clients

From the clients, the problem is typically noticed with errors like these:

Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (0): c: 0, oc: 1, rc: 8
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (1): c: 0, oc: 2, rc: 8
Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (6): c: 0, oc: 1, rc: 8
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: 2552:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1638340518/real 1638340524]  req@ffffa0b58e61ec00 x1711246854620928/t0(0) o400->oak-
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: Skipped 50 previous similar messages
Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)

The problem can be seen on the servers with the following errors, for example from an OSS 10.0.2.118@o2ib5:

Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff997bbb3a9000
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 3, status -103, desc ffff99c2245edc00
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 34635:0:(ldlm_lib.c:3362:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff99d6f1953050 x1711214487574208/t0(0) o4->f876dd06-7d3a-b247-9185-2267ed691876@10.51.2.43@o2ib3:532/0 lens 488/448 e 0 to 0 dl 1638340717 ref 1 fl Interpret:/0/0 rc 0/0

The suspect router itself (10.51.0.132@o2ib3) shows lots of Rx failed towards o2ib3 (the clients):

00000800:00000100:15.0:1638339585.687821:0:28664:0:(o2iblnd_cb.c:2881:kiblnd_check_reconnect()) 10.51.2.43@o2ib3: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:17.0:1638339585.687858:0:4020:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.51.2.43@o2ib3 failed: 5

also RDMA against the servers:

00000800:00020000:31.0:1638339631.216651:0:4019:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue_nocred, 0 seconds
00000800:00020000:31.0:1638339631.216653:0:4019:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.0.2.114@o2ib5 (6): c: 0, oc: 0, rc: 7
00000800:00000100:31.0:1638339631.216655:0:4019:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 10.0.2.114@o2ib5: error -110(sending_nocred)(waiting)

I'm attaching +net dk log of the router containing the messages above as sh03-oak02.dk.log.gz

If I check the peers on o2ib3 (the clients), the queued messages are mainly with router 10.51.0.132@o2ib3 (I checked multiple times and it's clear):

# clush -u 5 -w @island:3\&@compute "lctl get_param peers | awk '/up/ && \$10 > 0 {print}'" | sort -k3nr
sh03-05n57: 10.51.0.132@o2ib3         574    up    -1     8     8     8  -562 -4853 173368
sh03-06n37: 10.51.0.132@o2ib3         572    up    -1     8     8     8  -560 -5437 168720
sh03-04n38: 10.51.0.132@o2ib3         569    up    -1     8     8     8  -557 -13788 167536
sh03-02n56: 10.51.0.132@o2ib3         479    up    -1     8     8     8  -461 -9506 6428640
sh03-04n38: 10.51.0.131@o2ib3         476    up    -1     8     8     8  -464 -13727 140896
sh03-06n66: 10.51.0.132@o2ib3         434    up    -1     8     8     8  -422 -9398 129056
sh03-02n48: 10.51.0.132@o2ib3         384    up    -1     8     8     8  -372 -5318 113072
sh03-06n66: 10.51.0.131@o2ib3         384    up    -1     8     8     8  -372 -9363 113664
sh03-12n11: 10.51.0.132@o2ib3         380    up    -1     8     8     8  -368 -5749 116984
sh03-15n10: 10.51.0.132@o2ib3         318    up    -1     8     8     8  -306 -4950 92944
sh03-02n42: 10.51.0.132@o2ib3         302    up    -1     8     8     8  -281 -9604 11618384
sh03-05n14: 10.51.0.132@o2ib3         294    up    -1     8     8     8  -282 -1379 88440
sh03-05n62: 10.51.0.132@o2ib3         273    up    -1     8     8     8  -261 -5518 79888
sh03-02n01: 10.51.0.132@o2ib3         266    up    -1     8     8     8  -254 -3913 77552
sh03-12n20: 10.51.0.132@o2ib3         264    up    -1     8     8     8  -252 -9347 77488
sh03-02n41: 10.51.0.132@o2ib3         230    up    -1     8     8     8  -214 -1963 6356352
sh03-02n53: 10.51.0.132@o2ib3         219    up    -1     8     8     8  -207 -6578 63936
sh03-16n22: 10.51.0.132@o2ib3         199    up    -1     8     8     8  -187 -9558 57720
sh03-02n57: 10.51.0.132@o2ib3         195    up    -1     8     8     8  -182 -5138 3200560
sh03-06n21: 10.51.0.132@o2ib3         135    up    -1     8     8     8  -123 -4791 38776
sh03-02n56: 10.51.0.131@o2ib3         127    up    -1     8     8     8  -109 -9505 6329512
sh03-02n54: 10.51.0.132@o2ib3         112    up    -1     8     8     8   -91 -14602 4173560
sh03-02n49: 10.51.0.132@o2ib3          95    up    -1     8     8     8   -78 -5757 10510032
sh03-02n66: 10.51.0.132@o2ib3          92    up    -1     8     8     8   -51 -1255 49290688
sh03-05n29: 10.51.0.132@o2ib3          91    up    -1     8     8     8   -79 -6567 25752
sh03-01n07: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2130 25456
sh03-04n48: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -4891 25456
sh03-06n55: 10.51.0.132@o2ib3          90    up    -1     8     8     8   -78 -2319 25456
sh03-06n32: 10.51.0.132@o2ib3          88    up    -1     8     8     8   -76 -6881 24864
sh03-05n01: 10.51.0.132@o2ib3          85    up    -1     8     8     8   -73 -9823 23976
sh03-02n72: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -72 -6336 23680
sh03-06n36: 10.51.0.132@o2ib3          84    up    -1     8     8     8   -71 -9006 58137
sh03-04n24: 10.51.0.132@o2ib3          82    up    -1     8     8     8   -70 -4829 23088
sh03-02n50: 10.51.0.132@o2ib3          81    up    -1     8     8     8   -37 -12602 45092400
sh03-01n16: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -12919 22200
sh03-02n59: 10.51.0.132@o2ib3          79    up    -1     8     8     8   -67 -11661 22200
sh03-02n55: 10.51.0.132@o2ib3          78    up    -1     8     8     8   -63 -1779 2429624
sh03-02n13: 10.51.0.132@o2ib3          76    up    -1     8     8     8   -64 -7355 21312
sh03-06n33: 10.51.0.132@o2ib3          74    up    -1     8     8     8   -62 -17726 20952
sh03-05n22: 10.51.0.132@o2ib3          73    up    -1     8     8     8   -61 -10495 20424
sh03-01n18: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5547 20128
sh03-04n42: 10.51.0.132@o2ib3          72    up    -1     8     8     8   -60 -5387 20128
sh03-02n44: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -1764 21064
sh03-06n03: 10.51.0.132@o2ib3          71    up    -1     8     8     8   -59 -5982 20096
sh03-06n26: 10.51.0.132@o2ib3          70    up    -1     8     8     8   -58 -6808 19536
sh03-04n44: 10.51.0.132@o2ib3          63    up    -1     8     8     8   -51 -6094 17464
sh03-12n24: 10.51.0.132@o2ib3          61    up    -1     8     8     8   -49 -4673 16872
sh03-02n47: 10.51.0.132@o2ib3          55    up    -1     8     8     8   -27 -4385 26218872
sh03-02n39: 10.51.0.132@o2ib3          52    up    -1     8     8     8   -40 -4968 14208
sh03-02n54: 10.51.0.131@o2ib3          51    up    -1     8     8     8   -39 -14431 4158208
sh03-06n35: 10.51.0.132@o2ib3          48    up    -1     8     8     8   -36 -4772 13024
sh03-04n22: 10.51.0.132@o2ib3          45    up    -1     8     8     8   -29 -10100 7350704
sh03-04n19: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -5526 10952
sh03-05n02: 10.51.0.132@o2ib3          41    up    -1     8     8     8   -29 -2591 11056
sh03-06n37: 10.51.0.131@o2ib3          38    up    -1     8     8     8   -26 -5438 10064
sh03-05n57: 10.51.0.131@o2ib3          36    up    -1     8     8     8   -24 -4852 9736
sh03-06n14: 10.51.0.132@o2ib3          34    up    -1     8     8     8   -22 -5228 9672
sh03-04n36: 10.51.0.132@o2ib3          32    up    -1     8     8     8   -20 -6917 8288
sh03-02n48: 10.51.0.131@o2ib3          31    up    -1     8     8     8   -19 -5319 7992
sh03-02n42: 10.51.0.131@o2ib3          28    up    -1     8     8     8    -8 -9221 5858432
sh03-05n04: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -5320 6216
sh03-05n37: 10.51.0.132@o2ib3          25    up    -1     8     8     8   -13 -4483 6576
sh03-02n66: 10.51.0.131@o2ib3          24    up    -1     8     8     8     7 -1886 1048648
sh03-06n23: 10.51.0.132@o2ib3          23    up    -1     8     8     8   -11 -17351 5624
sh03-02n50: 10.51.0.131@o2ib3          21    up    -1     8     8     8     5 -12627 3145944
sh03-04n69: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -5736 4736
sh03-05n25: 10.51.0.132@o2ib3          20    up    -1     8     8     8    -8 -7456 4736
sh03-16n22: 10.51.0.131@o2ib3          19    up    -1     8     8     8    -7 -10269 4560
sh03-06n30: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -4972 2368
sh03-06n34: 10.51.0.132@o2ib3          12    up    -1     8     8     8     0 -6922 2368
sh03-02n46: 10.51.0.132@o2ib3           9    up    -1     8     8     8     4 -7325 4194592
sh03-06n64: 10.51.0.132@o2ib3           7    up    -1     8     8     8     5 -2482 888
sh03-14n24: 10.51.0.132@o2ib3           5    up    -1     8     8     8     7 -13642 4320

From a client again, this is the peer state of the router:

[root@sh03-04n19 ~]# lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
peer:
    - primary nid: 10.51.0.132@o2ib3
      Multi-Rail: True
      peer state: 75
      peer ni:
        - nid: 10.51.0.132@o2ib3
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -5526
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 4
          statistics:
              send_count: 19788794
              recv_count: 21223791
              drop_count: 0
          sent_stats:
              put: 19537434
              get: 251360
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 20948617
              get: 170582
              reply: 104585
              ack: 7
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 14
              timeout: 17
              error: 0
              network timeout: 0
        - nid: 10.0.2.217@o2ib5
          state: NA
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0

Our lnet config is homogenous with:

[root@sh03-04n19 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

And pretty much using default tunables:

[root@sh03-04n19 ~]# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
    - net type: o2ib3
      local NI(s):
        - nid: 10.51.4.19@o2ib3
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 267775369
              recv_count: 266668619
              drop_count: 59
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 1
          tcp bonding: 0
          CPT: "[0,1]"

Note that discovery is disabled on the routers.

The storage system at o2ib5 is connected to 3 other fabrics through routers without issues at the moment (1 x tcp and 2 other IB fabrics with even more clients), so it's definitely a problem with o2ib3 routers or clients, likely 10.51.0.132@o2ib3. Let me know if you have idea on how to troubleshoot these problems. We're now running 2.12.7 everywhere. Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sh03-oak02.dk.log.gz
4.36 MB
01/Dec/21 6:47 AM
oak_oss_peer_show_132.txt
23 kB
01/Dec/21 10:05 PM
oak_oss_peer_show_131.txt
23 kB
01/Dec/21 10:05 PM
oak_oss_lnetctl_route.txt
7 kB
01/Dec/21 10:06 PM
oak_oss_stats_show.txt
8 kB
01/Dec/21 10:10 PM
routers_stats_show.txt
1 kB
01/Dec/21 10:11 PM
clients_stats_show.txt
2 kB
01/Dec/21 10:14 PM
oak_oss_lnetctl_net_show.txt
29 kB
01/Dec/21 10:16 PM
routers_lnetctl_net_show.txt
8 kB
01/Dec/21 10:17 PM
clients_lnetctl_net_show.txt
7 kB
01/Dec/21 10:17 PM

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Dec/21 6:50 AM

Updated:: 24/Feb/22 8:49 PM