[LU-14589] LNet routers don't reset peers after they reboot Created: 06/Apr/21 Updated: 29/Sep/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We are now using 2.14 LNet routers, with 2.12.6 servers and a mix of 2.13 and 2.14 clients, and it looks like routers are less resilient to node client failures. We had two peers (clients) running Lustre 2.13 that were in a bad state (RDMA timeout, but no apparent IB fabric problem), and our routers (2.14) had a lot of refs for them. These are peers 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2 below. We rebooted them, but even after that, we couldn't mount the filesystem, likely because the routers still had old references? This is after reboot of 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2: [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid refs state last max rtr min tx min queue
10.50.7.38@o2ib2 3 up -1 8 6 -8 8 -39 0
10.50.15.11@o2ib2 17 up -1 8 -8 -8 8 -43 0
10.50.5.55@o2ib2 14 up -1 8 -5 -8 8 -16 0
10.50.14.13@o2ib2 17 up -1 8 -8 -8 8 -19 0
10.50.5.68@o2ib2 17 up -1 8 -8 -8 8 -27 0
10.50.5.47@o2ib2 17 up -1 8 -8 -8 8 -45 0
10.50.5.60@o2ib2 17 up -1 8 -8 -8 8 -61 0
10.50.16.6@o2ib2 4 up -1 8 5 -8 8 -53 0
10.50.4.41@o2ib2 7 up -1 8 2 -8 8 -15 0
10.50.14.10@o2ib2 8 up -1 8 1 -8 8 -21 0
10.50.12.14@o2ib2 29128 up -1 8 -29119 -29119 8 -82 0 <<<<
10.50.1.60@o2ib2 8 up -1 8 1 -8 8 -20 0
10.50.13.4@o2ib2 3 up -1 8 6 -8 8 -20 0
10.50.14.15@o2ib2 7 up -1 8 2 -8 8 -28 0
10.50.1.18@o2ib2 3 up -1 8 6 -8 8 -13 0
10.50.15.5@o2ib2 17 up -1 8 -8 -8 8 -20 0
10.50.7.3@o2ib2 3 up -1 8 6 -8 8 -32 0
10.50.13.14@o2ib2 4 up -1 8 5 -8 8 -37 0
10.50.5.67@o2ib2 4 up -1 8 5 -8 8 -28 0
10.50.10.41@o2ib2 3 up -1 8 6 -8 8 -18 0
10.50.0.64@o2ib2 5 up -1 8 4 -24 8 -75 0
10.50.5.38@o2ib2 3 up -1 8 6 -8 8 -33 0
10.50.13.11@o2ib2 15 up -1 8 -6 -8 8 -28 0
10.50.5.9@o2ib2 4 up -1 8 5 -8 8 -20 0
10.50.12.13@o2ib2 4 up -1 8 5 -8 8 -39 0
10.50.5.43@o2ib2 4 up -1 8 5 -8 8 -24 0
10.50.1.59@o2ib2 6 up -1 8 3 -8 8 -20 0
10.50.12.5@o2ib2 3 up -1 8 6 -40 8 -52 0
10.50.16.2@o2ib2 16 up -1 8 -7 -8 8 -25 0
10.50.4.8@o2ib2 12 up -1 8 -3 -40 8 -37 0
10.50.15.9@o2ib2 8 up -1 8 1 -8 8 -27 0
10.50.5.53@o2ib2 17 up -1 8 -8 -8 8 -26 0
10.50.5.32@o2ib2 3 up -1 8 6 -8 8 -28 0
10.50.13.13@o2ib2 6 up -1 8 3 -8 8 -41 0
10.0.2.114@o2ib5 203 up -1 8 8 -8 -194 -1877 15190572
10.50.12.15@o2ib2 36367 up -1 8 -36358 -36358 8 -76 0 <<<
10.50.10.6@o2ib2 3 up -1 8 6 -8 8 -23 0
10.50.15.1@o2ib2 3 up -1 8 6 -8 8 -19 0
10.50.8.44@o2ib2 3 up -1 8 6 -8 8 -10 0
The routers were shown as up from both peers, but the filesystem couldn't be mounted. We manually deleted both peers from the routers: [root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.15@o2ib2 [root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.14@o2ib2 The situation was much better after that in terms of peers: [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid refs state last max rtr min tx min queue
[root@sh02-oak02 ~]#
And we were able to mount the filesystem on these two peers without problem. But I noticed that the refs have not been cleaned from the nis file: [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo up 0 2 0 0 0 0 0 10.50.0.132@o2ib2 up 0 65815 8 0 256 256 195 10.0.2.215@o2ib5 up 0 225 8 0 256 248 179
Our config: Routers: [root@sh02-oak02 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
router_sensitivity: 100
lnd_timeout: 49
response_tracking: 3
recovery_limit: 0
Clients: [root@sh02-12n14 ~]# lnetctl global show -v 3
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
router_sensitivity: 100
lnd_timeout: 49
response_tracking: 3
recovery_limit: 0
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 07/Apr/21 ] |
|
My current suspicion is the problem is in the rtr credits value. |
| Comment by Stephane Thiell [ 07/Apr/21 ] |
|
Thanks Amir for the explanation! Something was definitely going on with these two clients. I'm not sure what. We saw LNet RDMA timeout errors but the fabric was fine. They were running Lustre 2.13 when this happened and Mellanox OFED 5.0. We're slowly upgrading nodes to use Lustre client 2.14 instead in the hope there are a few lnet patches between 2.13 and 2.14 that could help. I waited more than a few minutes before dumping out /sys/kernel/debug/lnet/peers on the routers. I think it is a new situation, not necessarily a new issues. We have had some weird LNet errors lately as I said with 2.13 clients on a specific IB fabric (o2ib2). We also recently increased at_min on this filesystem to 75 and that helped significantly (even though this is not LNet directly but rather ptlrpc, it's interesting to note that this has helped reduce request timeout errors).
> It's also likely that even though the routed messages were not getting through, the client is able to still send the keep alive message to the router; therefore keeping the state as up. That sounds a bit weird to me. Perhaps if the routed messages are not getting through, the routers could do something to abort after some time? or switch the state to something else like down which would really mean "congested"? Otherwise, I like your idea of the sequence number! Today things are doing well for these two nodes in particular, but I noticed that the high refs number in /sys/kernel/debug/lnet/nis have still not be cleared: [root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo up 0 2 0 0 0 0 0 10.50.0.132@o2ib2 up 0 65647 8 0 256 254 195 10.0.2.215@o2ib5 up 0 127 8 0 256 248 179 This doesn't match the refs in /sys/kernel/debug/lnet/peers. Output of this file attached as sh02-oak02_peers.txt Also, please find detailed peer info (from today) for the two client NIDs that were causing problems yesterday, that were rebooted: [root@sh02-oak02 ~]# lnetctl peer show --nid 10.50.12.14@o2ib2 -v 4
peer:
- primary nid: 10.50.12.14@o2ib2
Multi-Rail: True
peer state: 273
peer ni:
- nid: 10.50.12.14@o2ib2
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -33
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: -5
refcount: 1
statistics:
send_count: 429820
recv_count: 424710
drop_count: 0
sent_stats:
put: 427490
get: 2313
reply: 7
ack: 10
hello: 0
received_stats:
put: 422389
get: 2319
reply: 0
ack: 2
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
[root@sh02-oak02 ~]# lnetctl peer show --nid 10.50.12.15@o2ib2 -v 4
peer:
- primary nid: 10.50.12.15@o2ib2
Multi-Rail: True
peer state: 273
peer ni:
- nid: 10.50.12.15@o2ib2
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -35
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: -8
refcount: 1
statistics:
send_count: 431486
recv_count: 428411
drop_count: 0
sent_stats:
put: 429155
get: 2315
reply: 11
ack: 5
hello: 0
received_stats:
put: 426088
get: 2321
reply: 0
ack: 2
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
|
| Comment by Stephane Thiell [ 09/Apr/21 ] |
|
We’ve hit that issue with another set of routers that we recently upgraded to Lustre 2.14 (for another storage system (scratch) connected to our cluster Sherlock). To be clear, not really the issue when a server reboot, but the issue of accumulated refs leading to a broken router. Example: Clients: *@o2ib2
We have started to see RDMA timed out errors with this router last night from all OSS, so we took a closer look. From router sh02-fir02, peer 10.0.10.101@o2ib7 is seen with many rtr credits: [root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v 4
peer:
- primary nid: 10.0.10.101@o2ib7
Multi-Rail: True
peer state: 273
peer ni:
- nid: 10.0.10.101@o2ib7
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -84
tx_q_num_of_buf: 0
available_rtr_credits: -68955 <<<
min_rtr_credits: -68955
refcount: 68964
statistics:
send_count: 260217968
recv_count: 264915739
drop_count: 0
sent_stats:
put: 213920018
get: 9654
reply: 11769852
ack: 34518444
hello: 0
received_stats:
put: 253300757
get: 11614959
reply: 12
ack: 11
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 721
timeout: 0
error: 0
network timeout: 0
available_rtr_credits keeps increasing for this peer: [root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v 4 | grep available_rtr_credits
available_rtr_credits: -69577
[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v 4 | grep available_rtr_credits
available_rtr_credits: -69583
On all servers, we see errors with this router: Apr 09 12:00:44 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bf5f10bf800 Apr 09 12:00:44 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bef240de800 Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8bef240dd400 Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8be89d402c00 Apr 09 12:00:50 fir-io1-s1 kernel: LustreError: 129066:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8c081dba8400 Apr 09 12:00:25 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff897ad4ce8000 Apr 09 12:00:31 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff899ef77ebc00 Apr 09 12:00:31 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89ce2685b400 Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400 Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400 Apr 09 12:00:38 fir-io4-s1 kernel: LustreError: 119829:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff89cde6f8d400 Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 95 previous similar messages Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.0.10.225@o2ib7 (0): c: 0, oc: 3, rc: 1 Apr 09 12:00:50 fir-io4-s1 kernel: LNetError: 119829:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 95 previous similar messages Attaching:
We've seen that with another router too, but the problem is the same. We're going to downgrade these routers to Lustre 2.12.6 for now. |
| Comment by Stephane Thiell [ 29/Sep/21 ] |
|
I'm still seeing these stuck routers, especially after reboot of routers (even if they are fine before), with lots of refs and shortly running out of rtr credits. Servers, routers and clients are running 2.12.7 in that case. [root@sh03-oak02 ~]# lnetctl peer show --nid 10.0.2.51@o2ib5 -v 4
peer:
- primary nid: 10.0.2.51@o2ib5
Multi-Rail: False
peer state: 16
peer ni:
- nid: 10.0.2.51@o2ib5
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -133
tx_q_num_of_buf: 0
available_rtr_credits: -1088 <<<
min_rtr_credits: -1088
refcount: 1097
statistics:
send_count: 938444
recv_count: 893
drop_count: 0
sent_stats:
put: 668717
get: 6
reply: 0
ack: 269721
hello: 0
received_stats:
put: 887
get: 6
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 4
timeout: 0
error: 0
network timeout: 0
$ cat /sys/kernel/debug/lnet/peers | awk '$2 > 20 {print }'
nid refs state last max rtr min tx min queue
10.0.2.51@o2ib5 1097 up 160 8 -1088 -1088 8 -133 0
Rebooting the stuck router doesn't necessarily help. |