Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While testing the fix for LU-13708, we found that communication was still severely disrupted.

      I ran some ping tests between a client and a server. From the logs, we could see the router was attempting to forward a message using an interface that had been disabled:

      00000400:00000200:0.0:1593005930.168691:0:7991:0:(lib-move.c:4323:lnet_parse()) TRACE: 10.16.100.56@o2ib10(605@gni) <- 610@gni : GET - routed
      00000800:00000200:0.0:1593005930.168698:0:7991:0:(gnilnd_cb.c:2450:kgnilnd_recv()) $$ conn ffff88082b37c800, rxmsg ffffc900201321c8, lntmsg ffff880716ef6040 niov=0 kiov=          (null) iov=          (null) offset=0 mlen=0 rlen=0 from 610@gni  msg@0xffffc900201321c8 m/v/ty/ck/pck/pl b00fbabe/8/2/0/0/0 x2156869:GNILND_MSG_IMMEDIATE
      00000800:00000200:0.0:1593005930.168707:0:7991:0:(gnilnd_cb.c:2092:kgnilnd_consume_rx()) $$ rx ffff880829491d80 processed from 610@gni  msg@0xffffc900201321c8 m/v/ty/ck/pck/pl b00fbabe/8/2/0/0/0 x2156869:GNILND_MSG_IMMEDIATE
      00000800:00000200:0.0:1593005930.168711:0:7991:0:(gnilnd_cb.c:2058:kgnilnd_release_msg()) consuming ffff88082b37c800
      00000400:00000200:0.0:1593005930.168718:0:7991:0:(lib-msg.c:996:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0
      00000400:00000200:0.0:1593005930.168722:0:7991:0:(lib-msg.c:825:lnet_health_check()) health check: 605@gni->610@gni: GET: OK
      00000400:00000200:0.0:1593005930.168727:0:7991:0:(lib-move.c:2624:lnet_handle_send_case_locked()) Source ANY to NMR:  10.16.100.56@o2ib10 local destination
      00000400:00000200:0.0:1593005930.168737:0:7991:0:(lib-move.c:1853:lnet_handle_send()) TRACE: 610@gni(10.16.100.14@o2ib10:<?>) -> 10.16.100.56@o2ib10(10.16.100.56@o2ib10:10.16.100.56@o2ib10) <?> : GET try# 0 <<<< 10.16.100.14@o2ib10 is disabled interface
      

      There's a flaw in the logic used to forward the message. The path selection code treats this like a local send to a non-multi-rail peer. The reason for this is that we don't want the router to modify the destination interface. However, this code path sets a "preferred NI" that gets used for future sends. In this case, the first time the router forwarded a message to 10.16.100.56@o2ib10, it set 10.16.100.14@o2ib10 as the "preferred NI" to be used when communicating with this node. Now, even though that interface is down, it is still selecting it to forward messages because of this preferred status.

      Attachments

        Activity

          [LU-13712] Flaw in MR Routing Algorithm
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39168/
          Subject: LU-13712 lnet: Preferred NI logic breaks MR routing
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: ef6c35877b96c11a83a6cb823bf66e44bf355ed3

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39168/ Subject: LU-13712 lnet: Preferred NI logic breaks MR routing Project: fs/lustre-release Branch: master Current Patch Set: Commit: ef6c35877b96c11a83a6cb823bf66e44bf355ed3

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39168
          Subject: LU-13712 lnet: Preferred NI logic breaks MR routing
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 19efba0448bb3244200b24b92debc6fe1eac26a2

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39168 Subject: LU-13712 lnet: Preferred NI logic breaks MR routing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 19efba0448bb3244200b24b92debc6fe1eac26a2

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: