Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6060

ARF doesn't detect lack of interface on a router

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.3
    • None
    • 3
    • 16875

    Description

      When using Asymmetric router failure detection, the system appears unable to determine the lack of an expected interface. While a defined but non-functional interface is detected, the clients do not seem to detect when they have a route to a network via a router but that router has no means of getting the traffic there.

      Take for example a few nodes, login1, rtr5, rtr6, and mgs. This was demonstrated on live hardware, although the following example is abstracted/has changed addresses and names.

      Host: interfaces (routes)
      login1: 30@gni1 (o2ib1 via 27@gni1, o2ib1 via 31@gni1)
      rtr5: 27@gni1 10.1.1.5@o2ib1 ()
      rtr6: 31@gni1 10.1.1.6@o2ib1 ()
      mgs: 10.1.1.1@o2ib1 (gni1 via 10.1.1.5@o2ib1 and gni1 via 10.1.1.6@o2ib1)

      In other words, we have two routers with two interfaces each sitting between LNET1 and GNI1.

      Reproduction steps:
      Enable ARF via configs, ensure running
      Configure interface ib0 on rtr5 to not start on boot.
      Reboot rtr5 (ifconfig ib0 shows no ib0 down / no IP)
      start lnet (lctl net up)

      show missing interface on rtr5 via lctl list_nids
      rtr5:~ # lctl list_nids
      27@gni1
      rtr5:~ #

      on login1 ping mgs
      lctl ping 10.1.1.1@o2ib1 (result is 50% success, 50% I/O error)

      show routes
      login1:~ # lctl show_route
      net o2ib1 hops 1 gw 27@gni1 up pri 0
      net o2ib1 hops 1 gw 31@gni1 up pri 0

      look for down_ni
      login1:~ # cat /proc/sys/lnet/routers
      ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
      4 1 1 up 28 1 NA 0 27@gni1
      4 1 1 up 28 1 NA 0 31@gni1

      In other words, there is no way to get to o2ib1 via rtr5, but arf does not detect this. Presumably, at least in a non-multihop configuration, clients should be concerned not with whether the router has defined routes that aren't working, but wether the client has a defined route that a router can't handle due to a down interface or a lack of an interface.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              lewisj John Lewis (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: