Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6060

ARF doesn't detect lack of interface on a router

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.3
    • None
    • 3
    • 16875

    Description

      When using Asymmetric router failure detection, the system appears unable to determine the lack of an expected interface. While a defined but non-functional interface is detected, the clients do not seem to detect when they have a route to a network via a router but that router has no means of getting the traffic there.

      Take for example a few nodes, login1, rtr5, rtr6, and mgs. This was demonstrated on live hardware, although the following example is abstracted/has changed addresses and names.

      Host: interfaces (routes)
      login1: 30@gni1 (o2ib1 via 27@gni1, o2ib1 via 31@gni1)
      rtr5: 27@gni1 10.1.1.5@o2ib1 ()
      rtr6: 31@gni1 10.1.1.6@o2ib1 ()
      mgs: 10.1.1.1@o2ib1 (gni1 via 10.1.1.5@o2ib1 and gni1 via 10.1.1.6@o2ib1)

      In other words, we have two routers with two interfaces each sitting between LNET1 and GNI1.

      Reproduction steps:
      Enable ARF via configs, ensure running
      Configure interface ib0 on rtr5 to not start on boot.
      Reboot rtr5 (ifconfig ib0 shows no ib0 down / no IP)
      start lnet (lctl net up)

      show missing interface on rtr5 via lctl list_nids
      rtr5:~ # lctl list_nids
      27@gni1
      rtr5:~ #

      on login1 ping mgs
      lctl ping 10.1.1.1@o2ib1 (result is 50% success, 50% I/O error)

      show routes
      login1:~ # lctl show_route
      net o2ib1 hops 1 gw 27@gni1 up pri 0
      net o2ib1 hops 1 gw 31@gni1 up pri 0

      look for down_ni
      login1:~ # cat /proc/sys/lnet/routers
      ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
      4 1 1 up 28 1 NA 0 27@gni1
      4 1 1 up 28 1 NA 0 31@gni1

      In other words, there is no way to get to o2ib1 via rtr5, but arf does not detect this. Presumably, at least in a non-multihop configuration, clients should be concerned not with whether the router has defined routes that aren't working, but wether the client has a defined route that a router can't handle due to a down interface or a lack of an interface.

      Attachments

        Issue Links

          Activity

            [LU-6060] ARF doesn't detect lack of interface on a router
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13417/
            Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 749dc54622b2c3267c6c97eb770702b437a7897d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13417/ Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net Project: fs/lustre-release Branch: master Current Patch Set: Commit: 749dc54622b2c3267c6c97eb770702b437a7897d

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13417
            Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 826353849a9a51a4d4c53accd449e9427386f57c

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13417 Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 826353849a9a51a4d4c53accd449e9427386f57c

            Can you make a patch for master as well. Testing looks good for the patch you provided.

            simmonsja James A Simmons added a comment - Can you make a patch for master as well. Testing looks good for the patch you provided.

            James, I think the issue here is, we will not record downis if there is no NI for target network, above patch should fix this problem. Also, I'm wondering if this the same problem of LU-5758, could you please comment on 5758?

            liang Liang Zhen (Inactive) added a comment - James, I think the issue here is, we will not record downis if there is no NI for target network, above patch should fix this problem. Also, I'm wondering if this the same problem of LU-5758 , could you please comment on 5758?

            Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13162
            Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 5ed16f284871b6b898591815cb7d5468ae2c3fca

            gerrit Gerrit Updater added a comment - Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13162 Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 5ed16f284871b6b898591815cb7d5468ae2c3fca
            simmonsja James A Simmons added a comment - - edited

            Yes the patch for LU-5485 is included. Without the patch we can't mount a file system with ARF enabled.

            simmonsja James A Simmons added a comment - - edited Yes the patch for LU-5485 is included. Without the patch we can't mount a file system with ARF enabled.

            Hi John, do you have patch on LU-5485 in your environment?

            liang Liang Zhen (Inactive) added a comment - Hi John, do you have patch on LU-5485 in your environment?

            People

              wc-triage WC Triage
              lewisj John Lewis (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: