Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15446

Local recovery pings on MR nodes may not exercise all available paths

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      Typically, LNet peers do not perform discovery on themselves, so it is often the case that there is a non-MR peer entry for each local interface. For example:

      [root@kjcf01n05 ~]# lctl list_nids
      10.253.100.9@o2ib
      10.253.100.10@o2ib
      [root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.9@o2ib
      peer:
          - primary nid: 10.253.100.9@o2ib
            Multi-Rail: False
            peer ni:
              - nid: 10.253.100.9@o2ib
                state: NA
      [root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.10@o2ib
      peer:
          - primary nid: 10.253.100.10@o2ib
            Multi-Rail: False
            peer ni:
              - nid: 10.253.100.10@o2ib
                state: NA
      [root@kjcf01n05 ~]#
      

      Because of this, LNet sets a "preferred" local NI to use when sending traffic to these non-MR peers. This prevents LNet recovery pings from exercising other paths. e.g. consider a peer with two local interfaces, heth0 and heth1. We have the following paths for sending to heth0:

       heth0 -> heth0 heth1 -> heth0 

      And paths for sending to heth1:

       heth0 -> heth1 heth1 -> heth1 

      Because of the preferred NI for non-MR peer logic, whichever path is first chosen will then be used for every future send to that NI (unless the peer entry is deleted, then a new path may be chosen). It is not clear whether these local recovery pings are particularly useful in ascertaining the health of local interfaces, but if they are, then it seems we ought to allow LNet to exercise all possible paths.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: