Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Issue impacts all Lustre versions with LNet health.
The routing algorithm does not ensure that recovery pings go to correct NI. In fact, they are actively avoided since, by definition, the health of that NI is sub-optimal. Here's a simple test showing the problem (test requires LU-14939):
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show --nid 192.168.2.35@tcp1 peer: - primary nid: 192.168.2.34@tcp1 Multi-Rail: True peer ni: - nid: 192.168.2.34@tcp1 state: NA - nid: 192.168.2.35@tcp1 state: NA sles15c01:/home/hornc/lustre-filesystem # lctl list_nids 192.168.2.38@tcp2 192.168.2.39@tcp2 sles15c01:/home/hornc/lustre-filesystem # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.33@tcp2 up pri 0 sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health - primary nid: 192.168.2.34@tcp1 - nid: 192.168.2.34@tcp1 health stats: health value: 1000 - nid: 192.168.2.35@tcp1 health stats: health value: 1000 sles15c01:/home/hornc/lustre-filesystem # lnetctl peer set --health 0 --nid 192.168.2.35@tcp1 sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health - primary nid: 192.168.2.34@tcp1 - nid: 192.168.2.34@tcp1 health stats: health value: 1000 - nid: 192.168.2.35@tcp1 health stats: health value: 0 sles15c01:/home/hornc/lustre-filesystem # sleep 30 sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health - primary nid: 192.168.2.34@tcp1 - nid: 192.168.2.34@tcp1 health stats: health value: 1000 - nid: 192.168.2.35@tcp1 health stats: health value: 0 sles15c01:/home/hornc/lustre-filesystem # lctl dk > /tmp/dk.log sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep -c TRACE 513 sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.35@tcp1' sles15c01:/home/hornc/lustre-filesystem #
Landed for 2.15