Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
We are now using 2.14 LNet routers, with 2.12.6 servers and a mix of 2.13 and 2.14 clients, and it looks like routers are less resilient to node client failures.
We had two peers (clients) running Lustre 2.13 that were in a bad state (RDMA timeout, but no apparent IB fabric problem), and our routers (2.14) had a lot of refs for them. These are peers 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2 below. We rebooted them, but even after that, we couldn't mount the filesystem, likely because the routers still had old references?
This is after reboot of 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2:
[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid refs state last max rtr min tx min queue
10.50.7.38@o2ib2 3 up -1 8 6 -8 8 -39 0
10.50.15.11@o2ib2 17 up -1 8 -8 -8 8 -43 0
10.50.5.55@o2ib2 14 up -1 8 -5 -8 8 -16 0
10.50.14.13@o2ib2 17 up -1 8 -8 -8 8 -19 0
10.50.5.68@o2ib2 17 up -1 8 -8 -8 8 -27 0
10.50.5.47@o2ib2 17 up -1 8 -8 -8 8 -45 0
10.50.5.60@o2ib2 17 up -1 8 -8 -8 8 -61 0
10.50.16.6@o2ib2 4 up -1 8 5 -8 8 -53 0
10.50.4.41@o2ib2 7 up -1 8 2 -8 8 -15 0
10.50.14.10@o2ib2 8 up -1 8 1 -8 8 -21 0
10.50.12.14@o2ib2 29128 up -1 8 -29119 -29119 8 -82 0 <<<<
10.50.1.60@o2ib2 8 up -1 8 1 -8 8 -20 0
10.50.13.4@o2ib2 3 up -1 8 6 -8 8 -20 0
10.50.14.15@o2ib2 7 up -1 8 2 -8 8 -28 0
10.50.1.18@o2ib2 3 up -1 8 6 -8 8 -13 0
10.50.15.5@o2ib2 17 up -1 8 -8 -8 8 -20 0
10.50.7.3@o2ib2 3 up -1 8 6 -8 8 -32 0
10.50.13.14@o2ib2 4 up -1 8 5 -8 8 -37 0
10.50.5.67@o2ib2 4 up -1 8 5 -8 8 -28 0
10.50.10.41@o2ib2 3 up -1 8 6 -8 8 -18 0
10.50.0.64@o2ib2 5 up -1 8 4 -24 8 -75 0
10.50.5.38@o2ib2 3 up -1 8 6 -8 8 -33 0
10.50.13.11@o2ib2 15 up -1 8 -6 -8 8 -28 0
10.50.5.9@o2ib2 4 up -1 8 5 -8 8 -20 0
10.50.12.13@o2ib2 4 up -1 8 5 -8 8 -39 0
10.50.5.43@o2ib2 4 up -1 8 5 -8 8 -24 0
10.50.1.59@o2ib2 6 up -1 8 3 -8 8 -20 0
10.50.12.5@o2ib2 3 up -1 8 6 -40 8 -52 0
10.50.16.2@o2ib2 16 up -1 8 -7 -8 8 -25 0
10.50.4.8@o2ib2 12 up -1 8 -3 -40 8 -37 0
10.50.15.9@o2ib2 8 up -1 8 1 -8 8 -27 0
10.50.5.53@o2ib2 17 up -1 8 -8 -8 8 -26 0
10.50.5.32@o2ib2 3 up -1 8 6 -8 8 -28 0
10.50.13.13@o2ib2 6 up -1 8 3 -8 8 -41 0
10.0.2.114@o2ib5 203 up -1 8 8 -8 -194 -1877 15190572
10.50.12.15@o2ib2 36367 up -1 8 -36358 -36358 8 -76 0 <<<
10.50.10.6@o2ib2 3 up -1 8 6 -8 8 -23 0
10.50.15.1@o2ib2 3 up -1 8 6 -8 8 -19 0
10.50.8.44@o2ib2 3 up -1 8 6 -8 8 -10 0
The routers were shown as up from both peers, but the filesystem couldn't be mounted.
We manually deleted both peers from the routers:
[root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.15@o2ib2 [root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.14@o2ib2
The situation was much better after that in terms of peers:
[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid refs state last max rtr min tx min queue
[root@sh02-oak02 ~]#
And we were able to mount the filesystem on these two peers without problem.
But I noticed that the refs have not been cleaned from the nis file:
[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo up 0 2 0 0 0 0 0 10.50.0.132@o2ib2 up 0 65815 8 0 256 256 195 10.0.2.215@o2ib5 up 0 225 8 0 256 248 179
Our config:
Routers:
[root@sh02-oak02 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
router_sensitivity: 100
lnd_timeout: 49
response_tracking: 3
recovery_limit: 0
Clients:
[root@sh02-12n14 ~]# lnetctl global show -v 3
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
router_sensitivity: 100
lnd_timeout: 49
response_tracking: 3
recovery_limit: 0