Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13785

router ib interface was not configured on boot. gni clients mis-classified the router as multi-hop leading to evictions

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      An LNet router hit an odd problem. The router completed a reboot at 20:41:12

      [Mon Jul 13 20:41:12 2020] Sending ec_node_info with boot code 8 (NODE_INFO_OS_BOOT_SUCCEEDED) for nid 602
      

      but its ib0 interface didn't come up until 21:11:59

      [Mon Jul 13 20:39:17 2020] ib0: enabling connected mode will cause multicast packet drops
      [Mon Jul 13 20:39:17 2020] ib0: mtu > 4092 will cause multicast packet drops.
      [Mon Jul 13 20:39:17 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
      ...
      [Mon Jul 13 21:11:59 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
      

      Because of this change:

      commit 28324781942780cc149555ccfd3dcf9a8d2ffdfb
      Author: Amir Shehata <ashehata@whamcloud.com>
      Date:   Thu Nov 28 15:44:27 2019 -0800
      
          LU-13029 lnet: fix asym routing with multi-hop
      

      the gni clients classified the router as "multi-hop" and continued to use it. It should have been considered "down" (because of avoid_asym_router_failure). This lead to a bunch of evictions.

      We can keep the detection code, because it is useful to spot when things go awry, but when we actually determine route aliveness we should use the configured hop count.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: