[LU-15446] Local recovery pings on MR nodes may not exercise all available paths - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

Typically, LNet peers do not perform discovery on themselves, so it is often the case that there is a non-MR peer entry for each local interface. For example:

[root@kjcf01n05 ~]# lctl list_nids
10.253.100.9@o2ib
10.253.100.10@o2ib
[root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.9@o2ib
peer:
    - primary nid: 10.253.100.9@o2ib
      Multi-Rail: False
      peer ni:
        - nid: 10.253.100.9@o2ib
          state: NA
[root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.10@o2ib
peer:
    - primary nid: 10.253.100.10@o2ib
      Multi-Rail: False
      peer ni:
        - nid: 10.253.100.10@o2ib
          state: NA
[root@kjcf01n05 ~]#

Because of this, LNet sets a "preferred" local NI to use when sending traffic to these non-MR peers. This prevents LNet recovery pings from exercising other paths. e.g. consider a peer with two local interfaces, heth0 and heth1. We have the following paths for sending to heth0:

 heth0 -> heth0 heth1 -> heth0

And paths for sending to heth1:

 heth0 -> heth1 heth1 -> heth1

Because of the preferred NI for non-MR peer logic, whichever path is first chosen will then be used for every future send to that NI (unless the peer entry is deleted, then a new path may be chosen). It is not clear whether these local recovery pings are particularly useful in ascertaining the health of local interfaces, but if they are, then it seems we ought to allow LNet to exercise all possible paths.

Attachments

Activity

People

Assignee:: Chris Horn

Reporter:: Chris Horn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jan/22 8:51 PM

Updated:: 26/Aug/22 4:31 PM

Resolved:: 07/Feb/22 2:55 PM