Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14941

Remote NI recovery pings do not work in routed environment

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Issue impacts all Lustre versions with LNet health.

      The routing algorithm does not ensure that recovery pings go to correct NI. In fact, they are actively avoided since, by definition, the health of that NI is sub-optimal. Here's a simple test showing the problem (test requires LU-14939):

      sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show --nid 192.168.2.35@tcp1
      peer:
          - primary nid: 192.168.2.34@tcp1
            Multi-Rail: True
            peer ni:
              - nid: 192.168.2.34@tcp1
                state: NA
              - nid: 192.168.2.35@tcp1
                state: NA
      sles15c01:/home/hornc/lustre-filesystem # lctl list_nids
      192.168.2.38@tcp2
      192.168.2.39@tcp2
      sles15c01:/home/hornc/lustre-filesystem # lctl show_route
      net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 up pri 0
      sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
          - primary nid: 192.168.2.34@tcp1
              - nid: 192.168.2.34@tcp1
                health stats:
                    health value: 1000
              - nid: 192.168.2.35@tcp1
                health stats:
                    health value: 1000
      sles15c01:/home/hornc/lustre-filesystem # lnetctl peer set --health 0 --nid 192.168.2.35@tcp1
      sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
          - primary nid: 192.168.2.34@tcp1
              - nid: 192.168.2.34@tcp1
                health stats:
                    health value: 1000
              - nid: 192.168.2.35@tcp1
                health stats:
                    health value: 0
      sles15c01:/home/hornc/lustre-filesystem # sleep 30
      sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
          - primary nid: 192.168.2.34@tcp1
              - nid: 192.168.2.34@tcp1
                health stats:
                    health value: 1000
              - nid: 192.168.2.35@tcp1
                health stats:
                    health value: 0
      sles15c01:/home/hornc/lustre-filesystem # lctl dk > /tmp/dk.log
      sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep -c TRACE
      513
      sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.35@tcp1'
      sles15c01:/home/hornc/lustre-filesystem #
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: