Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5485

first mount always fail with avoid_asym_router_failure

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.4
    • None
    • None
    • 15306

    Description

      We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.

      We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:

      • LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
      • RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
      • before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
      • while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
      • mount will fail until the next time RC ping routers and get up-to-date information from them.

      I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.

      I don't have good solution yet, need more time to think about it, and discuss with Isaac.

      Attachments

        Issue Links

          Activity

            [LU-5485] first mount always fail with avoid_asym_router_failure

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/
            Subject: LU-5485 lnet: peer aliveness status and NI status
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/ Subject: LU-5485 lnet: peer aliveness status and NI status Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c

            Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under LU-5758.

            simmonsja James A Simmons added a comment - Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under LU-5758 .

            Patch landed to Master. If there is more work to be done in this ticket, please reopen.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. If there is more work to be done in this ticket, please reopen.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/
            Subject: LU-5485 lnet: peer aliveness status and NI status
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fb259fe85813e0f28ac7f7410689e3856ef26316

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/ Subject: LU-5485 lnet: peer aliveness status and NI status Project: fs/lustre-release Branch: master Current Patch Set: Commit: fb259fe85813e0f28ac7f7410689e3856ef26316

            Liang does this patch need to be applied for both clients and servers?

            simmonsja James A Simmons added a comment - Liang does this patch need to be applied for both clients and servers?

            I think we should have a dedicated patch for this issue, instead of putting everything in http://review.whamcloud.com/11748
            Here is the patch, Isaac, could you take a look?
            http://review.whamcloud.com/#/c/12453/

            liang Liang Zhen (Inactive) added a comment - I think we should have a dedicated patch for this issue, instead of putting everything in http://review.whamcloud.com/11748 Here is the patch, Isaac, could you take a look? http://review.whamcloud.com/#/c/12453/

            When we attempted to upgrade to 2.4 we had to turn off asym_router_failure in order to bring up our file system. Recently we upgraded to 2.5.3 and again we hit the issue of asym_router_failure breaking our systems. Currently we have it turned off in our system.

            simmonsja James A Simmons added a comment - When we attempted to upgrade to 2.4 we had to turn off asym_router_failure in order to bring up our file system. Recently we upgraded to 2.5.3 and again we hit the issue of asym_router_failure breaking our systems. Currently we have it turned off in our system.

            Due to Isaac's suggestion, I also try to address this issue in http://review.whamcloud.com/11748
            It's not ready for product yet, now it's only for testing and discussing.
            I may have a follow-on patch to reduce ping if router has recent aliveness information.

            liang Liang Zhen (Inactive) added a comment - Due to Isaac's suggestion, I also try to address this issue in http://review.whamcloud.com/11748 It's not ready for product yet, now it's only for testing and discussing. I may have a follow-on patch to reduce ping if router has recent aliveness information.

            There used to be a similar problem with conventional router pingers (i.e. without the asymmetrical pinger) at ORNL. ORNL often boots a whole client cluster (including the routers that connect to the server cluster) all together, so when a client's request arrives at a server there's a chance that all routers to the client cluster are still considered as dead by the server, then server will drop the reply as there's no route available to the client.

            A possible solution is:
            When a message arrives (in lnet_parse()) from a router, this is a good indication that the router is available. Check if our router status is up-to-date, in case the pinger hasn't been able to update it yet:

            • If the router is down, mark it as up.
            • If the router's corresponding far-side NI is down, mark it as up too.
            isaac Isaac Huang (Inactive) added a comment - There used to be a similar problem with conventional router pingers (i.e. without the asymmetrical pinger) at ORNL. ORNL often boots a whole client cluster (including the routers that connect to the server cluster) all together, so when a client's request arrives at a server there's a chance that all routers to the client cluster are still considered as dead by the server, then server will drop the reply as there's no route available to the client. A possible solution is: When a message arrives (in lnet_parse()) from a router, this is a good indication that the router is available. Check if our router status is up-to-date, in case the pinger hasn't been able to update it yet: If the router is down, mark it as up. If the router's corresponding far-side NI is down, mark it as up too.

            Isaac, could you please comment?

            liang Liang Zhen (Inactive) added a comment - Isaac, could you please comment?

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: