Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.7
-
None
-
CentOS 7.9
-
3
-
9223372036854775807
Description
I wanted to report a problem also related to LNet but I think could be different than the high refs count issue on stuck routers reported in LU-14652 (not seen here) and hopefully fixed by the patch provided in LU-15234 by Chris Horn. In this other case, as seen today, we get many LNet timeout errors, leading to disconnect/reconnect of clients, slowness and tickets from users. We have transaction_timeout set to 50 (no retry). at_min=75, timeout=100. The issue seems to be related to a single router between two IB HDR fabrics: o2ib5 (Oak, Lustre servers) and o2ib3 (Sherlock 3, compute nodes / clients). This is the topology:
*@o2ib5 servers <> 10.0.2.[216-217]@o2ib5 / 10.51.0.[132-131]@o2ib3 routers <> *@o2ib3 clients
From the clients, the problem is typically noticed with errors like these:
Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (0): c: 0, oc: 1, rc: 8 Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (1): c: 0, oc: 2, rc: 8 Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (6): c: 0, oc: 1, rc: 8 Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: 2552:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1638340518/real 1638340524] req@ffffa0b58e61ec00 x1711246854620928/t0(0) o400->oak- Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: Skipped 50 previous similar messages Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
The problem can be seen on the servers with the following errors, for example from an OSS 10.0.2.118@o2ib5:
Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff997bbb3a9000 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 3, status -103, desc ffff99c2245edc00 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 34635:0:(ldlm_lib.c:3362:target_bulk_io()) @@@ network error on bulk WRITE req@ffff99d6f1953050 x1711214487574208/t0(0) o4->f876dd06-7d3a-b247-9185-2267ed691876@10.51.2.43@o2ib3:532/0 lens 488/448 e 0 to 0 dl 1638340717 ref 1 fl Interpret:/0/0 rc 0/0
The suspect router itself (10.51.0.132@o2ib3) shows lots of Rx failed towards o2ib3 (the clients):
00000800:00000100:15.0:1638339585.687821:0:28664:0:(o2iblnd_cb.c:2881:kiblnd_check_reconnect()) 10.51.2.43@o2ib3: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:17.0:1638339585.687858:0:4020:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.51.2.43@o2ib3 failed: 5
also RDMA against the servers:
00000800:00020000:31.0:1638339631.216651:0:4019:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue_nocred, 0 seconds 00000800:00020000:31.0:1638339631.216653:0:4019:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.0.2.114@o2ib5 (6): c: 0, oc: 0, rc: 7 00000800:00000100:31.0:1638339631.216655:0:4019:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 10.0.2.114@o2ib5: error -110(sending_nocred)(waiting)
I'm attaching +net dk log of the router containing the messages above as sh03-oak02.dk.log.gz
If I check the peers on o2ib3 (the clients), the queued messages are mainly with router 10.51.0.132@o2ib3 (I checked multiple times and it's clear):
# clush -u 5 -w @island:3\&@compute "lctl get_param peers | awk '/up/ && \$10 > 0 {print}'" | sort -k3nr
sh03-05n57: 10.51.0.132@o2ib3 574 up -1 8 8 8 -562 -4853 173368
sh03-06n37: 10.51.0.132@o2ib3 572 up -1 8 8 8 -560 -5437 168720
sh03-04n38: 10.51.0.132@o2ib3 569 up -1 8 8 8 -557 -13788 167536
sh03-02n56: 10.51.0.132@o2ib3 479 up -1 8 8 8 -461 -9506 6428640
sh03-04n38: 10.51.0.131@o2ib3 476 up -1 8 8 8 -464 -13727 140896
sh03-06n66: 10.51.0.132@o2ib3 434 up -1 8 8 8 -422 -9398 129056
sh03-02n48: 10.51.0.132@o2ib3 384 up -1 8 8 8 -372 -5318 113072
sh03-06n66: 10.51.0.131@o2ib3 384 up -1 8 8 8 -372 -9363 113664
sh03-12n11: 10.51.0.132@o2ib3 380 up -1 8 8 8 -368 -5749 116984
sh03-15n10: 10.51.0.132@o2ib3 318 up -1 8 8 8 -306 -4950 92944
sh03-02n42: 10.51.0.132@o2ib3 302 up -1 8 8 8 -281 -9604 11618384
sh03-05n14: 10.51.0.132@o2ib3 294 up -1 8 8 8 -282 -1379 88440
sh03-05n62: 10.51.0.132@o2ib3 273 up -1 8 8 8 -261 -5518 79888
sh03-02n01: 10.51.0.132@o2ib3 266 up -1 8 8 8 -254 -3913 77552
sh03-12n20: 10.51.0.132@o2ib3 264 up -1 8 8 8 -252 -9347 77488
sh03-02n41: 10.51.0.132@o2ib3 230 up -1 8 8 8 -214 -1963 6356352
sh03-02n53: 10.51.0.132@o2ib3 219 up -1 8 8 8 -207 -6578 63936
sh03-16n22: 10.51.0.132@o2ib3 199 up -1 8 8 8 -187 -9558 57720
sh03-02n57: 10.51.0.132@o2ib3 195 up -1 8 8 8 -182 -5138 3200560
sh03-06n21: 10.51.0.132@o2ib3 135 up -1 8 8 8 -123 -4791 38776
sh03-02n56: 10.51.0.131@o2ib3 127 up -1 8 8 8 -109 -9505 6329512
sh03-02n54: 10.51.0.132@o2ib3 112 up -1 8 8 8 -91 -14602 4173560
sh03-02n49: 10.51.0.132@o2ib3 95 up -1 8 8 8 -78 -5757 10510032
sh03-02n66: 10.51.0.132@o2ib3 92 up -1 8 8 8 -51 -1255 49290688
sh03-05n29: 10.51.0.132@o2ib3 91 up -1 8 8 8 -79 -6567 25752
sh03-01n07: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -2130 25456
sh03-04n48: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -4891 25456
sh03-06n55: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -2319 25456
sh03-06n32: 10.51.0.132@o2ib3 88 up -1 8 8 8 -76 -6881 24864
sh03-05n01: 10.51.0.132@o2ib3 85 up -1 8 8 8 -73 -9823 23976
sh03-02n72: 10.51.0.132@o2ib3 84 up -1 8 8 8 -72 -6336 23680
sh03-06n36: 10.51.0.132@o2ib3 84 up -1 8 8 8 -71 -9006 58137
sh03-04n24: 10.51.0.132@o2ib3 82 up -1 8 8 8 -70 -4829 23088
sh03-02n50: 10.51.0.132@o2ib3 81 up -1 8 8 8 -37 -12602 45092400
sh03-01n16: 10.51.0.132@o2ib3 79 up -1 8 8 8 -67 -12919 22200
sh03-02n59: 10.51.0.132@o2ib3 79 up -1 8 8 8 -67 -11661 22200
sh03-02n55: 10.51.0.132@o2ib3 78 up -1 8 8 8 -63 -1779 2429624
sh03-02n13: 10.51.0.132@o2ib3 76 up -1 8 8 8 -64 -7355 21312
sh03-06n33: 10.51.0.132@o2ib3 74 up -1 8 8 8 -62 -17726 20952
sh03-05n22: 10.51.0.132@o2ib3 73 up -1 8 8 8 -61 -10495 20424
sh03-01n18: 10.51.0.132@o2ib3 72 up -1 8 8 8 -60 -5547 20128
sh03-04n42: 10.51.0.132@o2ib3 72 up -1 8 8 8 -60 -5387 20128
sh03-02n44: 10.51.0.132@o2ib3 71 up -1 8 8 8 -59 -1764 21064
sh03-06n03: 10.51.0.132@o2ib3 71 up -1 8 8 8 -59 -5982 20096
sh03-06n26: 10.51.0.132@o2ib3 70 up -1 8 8 8 -58 -6808 19536
sh03-04n44: 10.51.0.132@o2ib3 63 up -1 8 8 8 -51 -6094 17464
sh03-12n24: 10.51.0.132@o2ib3 61 up -1 8 8 8 -49 -4673 16872
sh03-02n47: 10.51.0.132@o2ib3 55 up -1 8 8 8 -27 -4385 26218872
sh03-02n39: 10.51.0.132@o2ib3 52 up -1 8 8 8 -40 -4968 14208
sh03-02n54: 10.51.0.131@o2ib3 51 up -1 8 8 8 -39 -14431 4158208
sh03-06n35: 10.51.0.132@o2ib3 48 up -1 8 8 8 -36 -4772 13024
sh03-04n22: 10.51.0.132@o2ib3 45 up -1 8 8 8 -29 -10100 7350704
sh03-04n19: 10.51.0.132@o2ib3 41 up -1 8 8 8 -29 -5526 10952
sh03-05n02: 10.51.0.132@o2ib3 41 up -1 8 8 8 -29 -2591 11056
sh03-06n37: 10.51.0.131@o2ib3 38 up -1 8 8 8 -26 -5438 10064
sh03-05n57: 10.51.0.131@o2ib3 36 up -1 8 8 8 -24 -4852 9736
sh03-06n14: 10.51.0.132@o2ib3 34 up -1 8 8 8 -22 -5228 9672
sh03-04n36: 10.51.0.132@o2ib3 32 up -1 8 8 8 -20 -6917 8288
sh03-02n48: 10.51.0.131@o2ib3 31 up -1 8 8 8 -19 -5319 7992
sh03-02n42: 10.51.0.131@o2ib3 28 up -1 8 8 8 -8 -9221 5858432
sh03-05n04: 10.51.0.132@o2ib3 25 up -1 8 8 8 -13 -5320 6216
sh03-05n37: 10.51.0.132@o2ib3 25 up -1 8 8 8 -13 -4483 6576
sh03-02n66: 10.51.0.131@o2ib3 24 up -1 8 8 8 7 -1886 1048648
sh03-06n23: 10.51.0.132@o2ib3 23 up -1 8 8 8 -11 -17351 5624
sh03-02n50: 10.51.0.131@o2ib3 21 up -1 8 8 8 5 -12627 3145944
sh03-04n69: 10.51.0.132@o2ib3 20 up -1 8 8 8 -8 -5736 4736
sh03-05n25: 10.51.0.132@o2ib3 20 up -1 8 8 8 -8 -7456 4736
sh03-16n22: 10.51.0.131@o2ib3 19 up -1 8 8 8 -7 -10269 4560
sh03-06n30: 10.51.0.132@o2ib3 12 up -1 8 8 8 0 -4972 2368
sh03-06n34: 10.51.0.132@o2ib3 12 up -1 8 8 8 0 -6922 2368
sh03-02n46: 10.51.0.132@o2ib3 9 up -1 8 8 8 4 -7325 4194592
sh03-06n64: 10.51.0.132@o2ib3 7 up -1 8 8 8 5 -2482 888
sh03-14n24: 10.51.0.132@o2ib3 5 up -1 8 8 8 7 -13642 4320
From a client again, this is the peer state of the router:
[root@sh03-04n19 ~]# lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
peer:
- primary nid: 10.51.0.132@o2ib3
Multi-Rail: True
peer state: 75
peer ni:
- nid: 10.51.0.132@o2ib3
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -5526
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 4
statistics:
send_count: 19788794
recv_count: 21223791
drop_count: 0
sent_stats:
put: 19537434
get: 251360
reply: 0
ack: 0
hello: 0
received_stats:
put: 20948617
get: 170582
reply: 104585
ack: 7
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 14
timeout: 17
error: 0
network timeout: 0
- nid: 10.0.2.217@o2ib5
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
Our lnet config is homogenous with:
[root@sh03-04n19 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
And pretty much using default tunables:
[root@sh03-04n19 ~]# lnetctl net show -v
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
- net type: o2ib3
local NI(s):
- nid: 10.51.4.19@o2ib3
status: up
interfaces:
0: ib0
statistics:
send_count: 267775369
recv_count: 266668619
drop_count: 59
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
lnd tunables:
dev cpt: 1
tcp bonding: 0
CPT: "[0,1]"
Note that discovery is disabled on the routers.
The storage system at o2ib5 is connected to 3 other fabrics through routers without issues at the moment (1 x tcp and 2 other IB fabrics with even more clients), so it's definitely a problem with o2ib3 routers or clients, likely 10.51.0.132@o2ib3. Let me know if you have idea on how to troubleshoot these problems. We're now running 2.12.7 everywhere. Thanks!