Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

We are now using 2.14 LNet routers, with 2.12.6 servers and a mix of 2.13 and 2.14 clients, and it looks like routers are less resilient to node client failures.

We had two peers (clients) running Lustre 2.13 that were in a bad state (RDMA timeout, but no apparent IB fabric problem), and our routers (2.14) had a lot of refs for them. These are peers 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2 below. We rebooted them, but even after that, we couldn't mount the filesystem, likely because the routers still had old references?

This is after reboot of 10.50.12.14@o2ib2 and 10.50.12.15@o2ib2:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid                      refs state  last   max   rtr   min    tx   min queue
10.50.7.38@o2ib2            3    up    -1     8     6    -8     8   -39 0
10.50.15.11@o2ib2          17    up    -1     8    -8    -8     8   -43 0
10.50.5.55@o2ib2           14    up    -1     8    -5    -8     8   -16 0
10.50.14.13@o2ib2          17    up    -1     8    -8    -8     8   -19 0
10.50.5.68@o2ib2           17    up    -1     8    -8    -8     8   -27 0
10.50.5.47@o2ib2           17    up    -1     8    -8    -8     8   -45 0
10.50.5.60@o2ib2           17    up    -1     8    -8    -8     8   -61 0
10.50.16.6@o2ib2            4    up    -1     8     5    -8     8   -53 0
10.50.4.41@o2ib2            7    up    -1     8     2    -8     8   -15 0
10.50.14.10@o2ib2           8    up    -1     8     1    -8     8   -21 0
10.50.12.14@o2ib2        29128    up    -1     8 -29119 -29119     8   -82 0    <<<<
10.50.1.60@o2ib2            8    up    -1     8     1    -8     8   -20 0
10.50.13.4@o2ib2            3    up    -1     8     6    -8     8   -20 0
10.50.14.15@o2ib2           7    up    -1     8     2    -8     8   -28 0
10.50.1.18@o2ib2            3    up    -1     8     6    -8     8   -13 0
10.50.15.5@o2ib2           17    up    -1     8    -8    -8     8   -20 0
10.50.7.3@o2ib2             3    up    -1     8     6    -8     8   -32 0
10.50.13.14@o2ib2           4    up    -1     8     5    -8     8   -37 0
10.50.5.67@o2ib2            4    up    -1     8     5    -8     8   -28 0
10.50.10.41@o2ib2           3    up    -1     8     6    -8     8   -18 0
10.50.0.64@o2ib2            5    up    -1     8     4   -24     8   -75 0
10.50.5.38@o2ib2            3    up    -1     8     6    -8     8   -33 0
10.50.13.11@o2ib2          15    up    -1     8    -6    -8     8   -28 0
10.50.5.9@o2ib2             4    up    -1     8     5    -8     8   -20 0
10.50.12.13@o2ib2           4    up    -1     8     5    -8     8   -39 0
10.50.5.43@o2ib2            4    up    -1     8     5    -8     8   -24 0
10.50.1.59@o2ib2            6    up    -1     8     3    -8     8   -20 0
10.50.12.5@o2ib2            3    up    -1     8     6   -40     8   -52 0
10.50.16.2@o2ib2           16    up    -1     8    -7    -8     8   -25 0
10.50.4.8@o2ib2            12    up    -1     8    -3   -40     8   -37 0
10.50.15.9@o2ib2            8    up    -1     8     1    -8     8   -27 0
10.50.5.53@o2ib2           17    up    -1     8    -8    -8     8   -26 0
10.50.5.32@o2ib2            3    up    -1     8     6    -8     8   -28 0
10.50.13.13@o2ib2           6    up    -1     8     3    -8     8   -41 0
10.0.2.114@o2ib5          203    up    -1     8     8    -8  -194 -1877 15190572
10.50.12.15@o2ib2        36367    up    -1     8 -36358 -36358     8   -76 0      <<<
10.50.10.6@o2ib2            3    up    -1     8     6    -8     8   -23 0
10.50.15.1@o2ib2            3    up    -1     8     6    -8     8   -19 0
10.50.8.44@o2ib2            3    up    -1     8     6    -8     8   -10 0

The routers were shown as up from both peers, but the filesystem couldn't be mounted.

We manually deleted both peers from the routers:

[root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.15@o2ib2
[root@sh02-oak02 ~]# lnetctl peer del --prim_nid 10.50.12.14@o2ib2

The situation was much better after that in terms of peers:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/peers | awk '{ if ($2 > 2) print $0 }'
nid                      refs state  last   max   rtr   min    tx   min queue
[root@sh02-oak02 ~]#

And we were able to mount the filesystem on these two peers without problem.

But I noticed that the refs have not been cleaned from the nis file:

[root@sh02-oak02 ~]# cat /sys/kernel/debug/lnet/nis 
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
10.50.0.132@o2ib2            up     0 65815    8    0   256   256   195
10.0.2.215@o2ib5             up     0  225    8    0   256   248   179

Our config:

Routers:

[root@sh02-oak02 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
    router_sensitivity: 100
    lnd_timeout: 49
    response_tracking: 3
    recovery_limit: 0

Clients:

[root@sh02-12n14 ~]# lnetctl global show -v 3
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
    router_sensitivity: 100
    lnd_timeout: 49
    response_tracking: 3
    recovery_limit: 0

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

sh02-oak02_peers.txt
63 kB
07/Apr/21 5:57 PM
sh02-fir02.peers.txt
63 kB
09/Apr/21 7:08 PM
sh02-fir02.netshow.txt
4 kB
09/Apr/21 7:08 PM
sh02-fir02.peershow.all.txt
906 kB
09/Apr/21 7:09 PM
sh02-fir02.peershow.all.txt
906 kB
09/Apr/21 7:09 PM
sh02-fir02.dk-lnet.log.gz
4.27 MB
09/Apr/21 7:09 PM

Assignee:: Amir Shehata (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 06/Apr/21 10:48 PM

Updated:: 29/Sep/21 11:46 PM

Details

Description

Attachments

Attachments

Activity

People

Dates