Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5485

first mount always fail with avoid_asym_router_failure

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.4
    • None
    • None
    • 15306

    Description

      We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.

      We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:

      • LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
      • RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
      • before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
      • while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
      • mount will fail until the next time RC ping routers and get up-to-date information from them.

      I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.

      I don't have good solution yet, need more time to think about it, and discuss with Isaac.

      Attachments

        Issue Links

          Activity

            [LU-5485] first mount always fail with avoid_asym_router_failure
            pjones Peter Jones made changes -
            Labels Original: lu_st
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.7.0 [ 10631 ]
            Fix Version/s New: Lustre 2.5.4 [ 11190 ]
            Labels Original: lu_st mq414 New: lu_st

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/
            Subject: LU-5485 lnet: peer aliveness status and NI status
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/ Subject: LU-5485 lnet: peer aliveness status and NI status Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c

            Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under LU-5758.

            simmonsja James A Simmons added a comment - Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under LU-5758 .
            jlevi Jodi Levi (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

            Patch landed to Master. If there is more work to be done in this ticket, please reopen.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. If there is more work to be done in this ticket, please reopen.
            simmonsja James A Simmons made changes -
            Link New: This issue is related to LU-6060 [ LU-6060 ]

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/
            Subject: LU-5485 lnet: peer aliveness status and NI status
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fb259fe85813e0f28ac7f7410689e3856ef26316

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/ Subject: LU-5485 lnet: peer aliveness status and NI status Project: fs/lustre-release Branch: master Current Patch Set: Commit: fb259fe85813e0f28ac7f7410689e3856ef26316
            pjones Peter Jones made changes -
            Labels Original: lu_st New: lu_st mq414
            pjones Peter Jones made changes -
            Resolution Original: Duplicate [ 3 ]
            Status Original: Closed [ 6 ] New: Reopened [ 4 ]

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: