[LU-15295] LNet tx/rx timeout with routers Created: 01/Dec/21 Updated: 24/Feb/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I wanted to report a problem also related to LNet but I think could be different than the high refs count issue on stuck routers reported in LU-14652 (not seen here) and hopefully fixed by the patch provided in *@o2ib5 servers <> 10.0.2.[216-217]@o2ib5 / 10.51.0.[132-131]@o2ib3 routers <> *@o2ib3 clients From the clients, the problem is typically noticed with errors like these: Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (0): c: 0, oc: 1, rc: 8 Nov 30 22:25:26 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (1): c: 0, oc: 2, rc: 8 Nov 30 22:28:54 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Nov 30 22:35:24 sh03-04n19.int kernel: LNetError: 2146:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.51.0.132@o2ib3 (6): c: 0, oc: 1, rc: 8 Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: 2552:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1638340518/real 1638340524] req@ffffa0b58e61ec00 x1711246854620928/t0(0) o400->oak- Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection to oak-OST0059 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: Skipped 50 previous similar messages Nov 30 22:35:24 sh03-04n19.int kernel: Lustre: oak-OST0059-osc-ffffa0b9ecf67800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) The problem can be seen on the servers with the following errors, for example from an OSS 10.0.2.118@o2ib5: Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff997bbb3a9000 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff99a6d5b5c400 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 17387:0:(events.c:455:server_bulk_callback()) event type 3, status -103, desc ffff99c2245edc00 Nov 30 22:37:05 oak-io5-s2 kernel: LustreError: 34635:0:(ldlm_lib.c:3362:target_bulk_io()) @@@ network error on bulk WRITE req@ffff99d6f1953050 x1711214487574208/t0(0) o4->f876dd06-7d3a-b247-9185-2267ed691876@10.51.2.43@o2ib3:532/0 lens 488/448 e 0 to 0 dl 1638340717 ref 1 fl Interpret:/0/0 rc 0/0 The suspect router itself (10.51.0.132@o2ib3) shows lots of Rx failed towards o2ib3 (the clients): 00000800:00000100:15.0:1638339585.687821:0:28664:0:(o2iblnd_cb.c:2881:kiblnd_check_reconnect()) 10.51.2.43@o2ib3: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:17.0:1638339585.687858:0:4020:0:(o2iblnd_cb.c:507:kiblnd_rx_complete()) Rx from 10.51.2.43@o2ib3 failed: 5 also RDMA against the servers: 00000800:00020000:31.0:1638339631.216651:0:4019:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue_nocred, 0 seconds 00000800:00020000:31.0:1638339631.216653:0:4019:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.0.2.114@o2ib5 (6): c: 0, oc: 0, rc: 7 00000800:00000100:31.0:1638339631.216655:0:4019:0:(o2iblnd_cb.c:2101:kiblnd_close_conn_locked()) Closing conn to 10.0.2.114@o2ib5: error -110(sending_nocred)(waiting) I'm attaching +net dk log of the router containing the messages above as sh03-oak02.dk.log.gz If I check the peers on o2ib3 (the clients), the queued messages are mainly with router 10.51.0.132@o2ib3 (I checked multiple times and it's clear): # clush -u 5 -w @island:3\&@compute "lctl get_param peers | awk '/up/ && \$10 > 0 {print}'" | sort -k3nr
sh03-05n57: 10.51.0.132@o2ib3 574 up -1 8 8 8 -562 -4853 173368
sh03-06n37: 10.51.0.132@o2ib3 572 up -1 8 8 8 -560 -5437 168720
sh03-04n38: 10.51.0.132@o2ib3 569 up -1 8 8 8 -557 -13788 167536
sh03-02n56: 10.51.0.132@o2ib3 479 up -1 8 8 8 -461 -9506 6428640
sh03-04n38: 10.51.0.131@o2ib3 476 up -1 8 8 8 -464 -13727 140896
sh03-06n66: 10.51.0.132@o2ib3 434 up -1 8 8 8 -422 -9398 129056
sh03-02n48: 10.51.0.132@o2ib3 384 up -1 8 8 8 -372 -5318 113072
sh03-06n66: 10.51.0.131@o2ib3 384 up -1 8 8 8 -372 -9363 113664
sh03-12n11: 10.51.0.132@o2ib3 380 up -1 8 8 8 -368 -5749 116984
sh03-15n10: 10.51.0.132@o2ib3 318 up -1 8 8 8 -306 -4950 92944
sh03-02n42: 10.51.0.132@o2ib3 302 up -1 8 8 8 -281 -9604 11618384
sh03-05n14: 10.51.0.132@o2ib3 294 up -1 8 8 8 -282 -1379 88440
sh03-05n62: 10.51.0.132@o2ib3 273 up -1 8 8 8 -261 -5518 79888
sh03-02n01: 10.51.0.132@o2ib3 266 up -1 8 8 8 -254 -3913 77552
sh03-12n20: 10.51.0.132@o2ib3 264 up -1 8 8 8 -252 -9347 77488
sh03-02n41: 10.51.0.132@o2ib3 230 up -1 8 8 8 -214 -1963 6356352
sh03-02n53: 10.51.0.132@o2ib3 219 up -1 8 8 8 -207 -6578 63936
sh03-16n22: 10.51.0.132@o2ib3 199 up -1 8 8 8 -187 -9558 57720
sh03-02n57: 10.51.0.132@o2ib3 195 up -1 8 8 8 -182 -5138 3200560
sh03-06n21: 10.51.0.132@o2ib3 135 up -1 8 8 8 -123 -4791 38776
sh03-02n56: 10.51.0.131@o2ib3 127 up -1 8 8 8 -109 -9505 6329512
sh03-02n54: 10.51.0.132@o2ib3 112 up -1 8 8 8 -91 -14602 4173560
sh03-02n49: 10.51.0.132@o2ib3 95 up -1 8 8 8 -78 -5757 10510032
sh03-02n66: 10.51.0.132@o2ib3 92 up -1 8 8 8 -51 -1255 49290688
sh03-05n29: 10.51.0.132@o2ib3 91 up -1 8 8 8 -79 -6567 25752
sh03-01n07: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -2130 25456
sh03-04n48: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -4891 25456
sh03-06n55: 10.51.0.132@o2ib3 90 up -1 8 8 8 -78 -2319 25456
sh03-06n32: 10.51.0.132@o2ib3 88 up -1 8 8 8 -76 -6881 24864
sh03-05n01: 10.51.0.132@o2ib3 85 up -1 8 8 8 -73 -9823 23976
sh03-02n72: 10.51.0.132@o2ib3 84 up -1 8 8 8 -72 -6336 23680
sh03-06n36: 10.51.0.132@o2ib3 84 up -1 8 8 8 -71 -9006 58137
sh03-04n24: 10.51.0.132@o2ib3 82 up -1 8 8 8 -70 -4829 23088
sh03-02n50: 10.51.0.132@o2ib3 81 up -1 8 8 8 -37 -12602 45092400
sh03-01n16: 10.51.0.132@o2ib3 79 up -1 8 8 8 -67 -12919 22200
sh03-02n59: 10.51.0.132@o2ib3 79 up -1 8 8 8 -67 -11661 22200
sh03-02n55: 10.51.0.132@o2ib3 78 up -1 8 8 8 -63 -1779 2429624
sh03-02n13: 10.51.0.132@o2ib3 76 up -1 8 8 8 -64 -7355 21312
sh03-06n33: 10.51.0.132@o2ib3 74 up -1 8 8 8 -62 -17726 20952
sh03-05n22: 10.51.0.132@o2ib3 73 up -1 8 8 8 -61 -10495 20424
sh03-01n18: 10.51.0.132@o2ib3 72 up -1 8 8 8 -60 -5547 20128
sh03-04n42: 10.51.0.132@o2ib3 72 up -1 8 8 8 -60 -5387 20128
sh03-02n44: 10.51.0.132@o2ib3 71 up -1 8 8 8 -59 -1764 21064
sh03-06n03: 10.51.0.132@o2ib3 71 up -1 8 8 8 -59 -5982 20096
sh03-06n26: 10.51.0.132@o2ib3 70 up -1 8 8 8 -58 -6808 19536
sh03-04n44: 10.51.0.132@o2ib3 63 up -1 8 8 8 -51 -6094 17464
sh03-12n24: 10.51.0.132@o2ib3 61 up -1 8 8 8 -49 -4673 16872
sh03-02n47: 10.51.0.132@o2ib3 55 up -1 8 8 8 -27 -4385 26218872
sh03-02n39: 10.51.0.132@o2ib3 52 up -1 8 8 8 -40 -4968 14208
sh03-02n54: 10.51.0.131@o2ib3 51 up -1 8 8 8 -39 -14431 4158208
sh03-06n35: 10.51.0.132@o2ib3 48 up -1 8 8 8 -36 -4772 13024
sh03-04n22: 10.51.0.132@o2ib3 45 up -1 8 8 8 -29 -10100 7350704
sh03-04n19: 10.51.0.132@o2ib3 41 up -1 8 8 8 -29 -5526 10952
sh03-05n02: 10.51.0.132@o2ib3 41 up -1 8 8 8 -29 -2591 11056
sh03-06n37: 10.51.0.131@o2ib3 38 up -1 8 8 8 -26 -5438 10064
sh03-05n57: 10.51.0.131@o2ib3 36 up -1 8 8 8 -24 -4852 9736
sh03-06n14: 10.51.0.132@o2ib3 34 up -1 8 8 8 -22 -5228 9672
sh03-04n36: 10.51.0.132@o2ib3 32 up -1 8 8 8 -20 -6917 8288
sh03-02n48: 10.51.0.131@o2ib3 31 up -1 8 8 8 -19 -5319 7992
sh03-02n42: 10.51.0.131@o2ib3 28 up -1 8 8 8 -8 -9221 5858432
sh03-05n04: 10.51.0.132@o2ib3 25 up -1 8 8 8 -13 -5320 6216
sh03-05n37: 10.51.0.132@o2ib3 25 up -1 8 8 8 -13 -4483 6576
sh03-02n66: 10.51.0.131@o2ib3 24 up -1 8 8 8 7 -1886 1048648
sh03-06n23: 10.51.0.132@o2ib3 23 up -1 8 8 8 -11 -17351 5624
sh03-02n50: 10.51.0.131@o2ib3 21 up -1 8 8 8 5 -12627 3145944
sh03-04n69: 10.51.0.132@o2ib3 20 up -1 8 8 8 -8 -5736 4736
sh03-05n25: 10.51.0.132@o2ib3 20 up -1 8 8 8 -8 -7456 4736
sh03-16n22: 10.51.0.131@o2ib3 19 up -1 8 8 8 -7 -10269 4560
sh03-06n30: 10.51.0.132@o2ib3 12 up -1 8 8 8 0 -4972 2368
sh03-06n34: 10.51.0.132@o2ib3 12 up -1 8 8 8 0 -6922 2368
sh03-02n46: 10.51.0.132@o2ib3 9 up -1 8 8 8 4 -7325 4194592
sh03-06n64: 10.51.0.132@o2ib3 7 up -1 8 8 8 5 -2482 888
sh03-14n24: 10.51.0.132@o2ib3 5 up -1 8 8 8 7 -13642 4320
From a client again, this is the peer state of the router: [root@sh03-04n19 ~]# lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4
peer:
- primary nid: 10.51.0.132@o2ib3
Multi-Rail: True
peer state: 75
peer ni:
- nid: 10.51.0.132@o2ib3
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -5526
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 4
statistics:
send_count: 19788794
recv_count: 21223791
drop_count: 0
sent_stats:
put: 19537434
get: 251360
reply: 0
ack: 0
hello: 0
received_stats:
put: 20948617
get: 170582
reply: 104585
ack: 7
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 14
timeout: 17
error: 0
network timeout: 0
- nid: 10.0.2.217@o2ib5
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
Our lnet config is homogenous with: [root@sh03-04n19 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
And pretty much using default tunables: [root@sh03-04n19 ~]# lnetctl net show -v
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
- net type: o2ib3
local NI(s):
- nid: 10.51.4.19@o2ib3
status: up
interfaces:
0: ib0
statistics:
send_count: 267775369
recv_count: 266668619
drop_count: 59
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
lnd tunables:
dev cpt: 1
tcp bonding: 0
CPT: "[0,1]"
Note that discovery is disabled on the routers. The storage system at o2ib5 is connected to 3 other fabrics through routers without issues at the moment (1 x tcp and 2 other IB fabrics with even more clients), so it's definitely a problem with o2ib3 routers or clients, likely 10.51.0.132@o2ib3. Let me know if you have idea on how to troubleshoot these problems. We're now running 2.12.7 everywhere. Thanks! |
| Comments |
| Comment by Peter Jones [ 01/Dec/21 ] |
|
Serguei What do you suggest here? Peter |
| Comment by Serguei Smirnov [ 01/Dec/21 ] |
|
Hi, If I understand correctly, there are two routers between o2ib3 and o2ib5, but the issue happens mostly with "132"? If the router enters this "bad" state, can you still "lnetctl ping" it from clients and servers? Is the router able to recover on its own or it just continues to cause problems until rebooted? It would also be useful to take a look at lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4 lnetctl peer show --nid 10.51.0.131@o2ib3 -v 4 lnetctl route show -v 4 from the servers and several clients in order to check if there's any difference in how they see the routers and lnetctl global show lnetctl stats show lnetctl net show -v 4 from the two routers, servers and at least one "typical" client (you already provided the client "global" and "net" dumps above) If you are using lnet or ko2iblnd module parameter settings that are different from default, please attach the corresponding conf files. Thanks, Serguei. |
| Comment by Stephane Thiell [ 01/Dec/21 ] |
|
Thanks Seirguei, Yes, I see more RDMA timeout errors and queued messages towards "132". I'm still able to lctl ping, unlike when we hit the high refs count that never decreases problem described in LU-14652. Today, the problem reported in this ticket is still here between o2ib5 and o2ib3. We haven't rebooted the two routers yet. We would like to investigate more and understand what could be wrong. While gathering the commands you asked, I noticed one difference. We have 12 OSS, and from one of them (oak-io1-s1), the two routers are seen as non MR peers, while on the 11 other OSSes, the two routers are seen as MR peers. Output of: lnetctl peer show --nid 10.51.0.132@o2ib3 -v 4 available in oak_oss_peer_show_132.txt Output of: lnetctl peer show --nid 10.51.0.131@o2ib3 -v 4 available in oak_oss_peer_show_131.txt From oak-io1-s1, if I ping the routers on their o2ib5 interface, I see non-MR peers, like this: [root@oak-io1-s1 ~]# lnetctl peer show --nid 10.0.2.217@o2ib5 -v 4
peer:
- primary nid: 10.0.2.217@o2ib5
Multi-Rail: False
peer state: 0
peer ni:
- nid: 10.0.2.217@o2ib5
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -286
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 4
statistics:
send_count: 120315781
recv_count: 110271823
drop_count: 0
sent_stats:
put: 120240966
get: 74815
reply: 0
ack: 0
hello: 0
received_stats:
put: 101581707
get: 93
reply: 74563
ack: 8615460
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 20
timeout: 0
error: 0
network timeout: 0
oak-io1-s1 doesn't seem to be generating any error though, so I'm not sure this is related, but it's clearly a difference on the servers. Output of: lnetctl route show -v 4 available in oak_oss_lnetctl_route.txt but it's not easily readable so I like to use old-school lctl route_list: [root@oak-hn01 ~]# clush -w oak-io[1-6]-s[1-2] -b 'lctl route_list | sort' |& tee -a /tmp/lctlroute --------------- oak-io[1-6]-s[1-2] (12) --------------- net o2ib1 hops 4294967295 gw 10.0.2.212@o2ib5 up pri 0 net o2ib1 hops 4294967295 gw 10.0.2.213@o2ib5 up pri 0 net o2ib2 hops 4294967295 gw 10.0.2.214@o2ib5 up pri 0 net o2ib2 hops 4294967295 gw 10.0.2.215@o2ib5 up pri 0 net o2ib3 hops 4294967295 gw 10.0.2.216@o2ib5 up pri 0 net o2ib3 hops 4294967295 gw 10.0.2.217@o2ib5 up pri 0 net tcp1 hops 4294967295 gw 10.0.2.206@o2ib5 up pri 0 net tcp1 hops 4294967295 gw 10.0.2.207@o2ib5 up pri 0 Output of lnetctl global show From the servers: ---------------
oak-io[1-6]-s[1-2] (12)
---------------
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
From the routers: ---------------
sh03-oak[01-02] (2)
---------------
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
From the clients: global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
Output of lnetctl stats show
Output of lnetctl net show -v 4
|
| Comment by Serguei Smirnov [ 02/Dec/21 ] |
|
Stephane, This appears to be something that we can try addressing by increasing timeouts to allow things to complete. We can set lnet_transaction_timeout closer to the lustre at_min by running the following on all nodes: lnetctl set transaction_timeout 70 There's a chance that this can help deal with the delay due to high load and bursts being just a few seconds too long. It would be best if this could be reproduced on some sort of smaller development cluster. Thanks, Serguei. |
| Comment by Stephane Thiell [ 03/Dec/21 ] |
|
Hi Serguei, Thanks! It's done, we have increased transaction_timeout from 50 to 70 on servers, routers and clients. We'll see how it goes and I will report back. |
| Comment by James A Simmons [ 24/Feb/22 ] |
|
Did increasing the transaction_timeout help? ORNL is seeing similar problems and we increased our transactin_timeout and it didn't help. |
| Comment by Stephane Thiell [ 24/Feb/22 ] |
|
It might have helped a bit. But 70 is not necessarily the best value I guess. We have it set to 49 at the moment. The thing seems to be to change transaction_timeout to something other than 50, otherwise, a very low default timeout value of 5 secs is used, which is easier to reach on a large system. Chris Horn explained this in his comment from 12/5/21 in |