Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5485

first mount always fail with avoid_asym_router_failure

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.4
    • None
    • None
    • 15306

    Description

      We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.

      We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:

      • LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
      • RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
      • before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
      • while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
      • mount will fail until the next time RC ping routers and get up-to-date information from them.

      I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.

      I don't have good solution yet, need more time to think about it, and discuss with Isaac.

      Attachments

        Issue Links

          Activity

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: