Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Issue impacts all Lustre versions with LNet health.
The routing algorithm does not ensure that recovery pings go to correct NI. In fact, they are actively avoided since, by definition, the health of that NI is sub-optimal. Here's a simple test showing the problem (test requires LU-14939):
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show --nid 192.168.2.35@tcp1
peer:
- primary nid: 192.168.2.34@tcp1
Multi-Rail: True
peer ni:
- nid: 192.168.2.34@tcp1
state: NA
- nid: 192.168.2.35@tcp1
state: NA
sles15c01:/home/hornc/lustre-filesystem # lctl list_nids
192.168.2.38@tcp2
192.168.2.39@tcp2
sles15c01:/home/hornc/lustre-filesystem # lctl show_route
net tcp1 hops 4294967295 gw 192.168.2.33@tcp2 up pri 0
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
- primary nid: 192.168.2.34@tcp1
- nid: 192.168.2.34@tcp1
health stats:
health value: 1000
- nid: 192.168.2.35@tcp1
health stats:
health value: 1000
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer set --health 0 --nid 192.168.2.35@tcp1
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
- primary nid: 192.168.2.34@tcp1
- nid: 192.168.2.34@tcp1
health stats:
health value: 1000
- nid: 192.168.2.35@tcp1
health stats:
health value: 0
sles15c01:/home/hornc/lustre-filesystem # sleep 30
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
- primary nid: 192.168.2.34@tcp1
- nid: 192.168.2.34@tcp1
health stats:
health value: 1000
- nid: 192.168.2.35@tcp1
health stats:
health value: 0
sles15c01:/home/hornc/lustre-filesystem # lctl dk > /tmp/dk.log
sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep -c TRACE
513
sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.35@tcp1'
sles15c01:/home/hornc/lustre-filesystem #