[LU-5485] first mount always fail with avoid_asym_router_failure Created: 14/Aug/14  Updated: 27/Apr/15  Resolved: 08/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Improvement Priority: Minor
Reporter: Liang Zhen (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-5785 recovery-mds-scale test_failover_ost:... Resolved
Related
is related to LU-6060 ARF doesn't detect lack of interface ... Resolved
is related to LU-5758 enabling avoid_asym_router_failure pr... Resolved
Rank (Obsolete): 15306

 Description   

We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.

We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:

  • LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
  • RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
  • before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
  • while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
  • mount will fail until the next time RC ping routers and get up-to-date information from them.

I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.

I don't have good solution yet, need more time to think about it, and discuss with Isaac.



 Comments   
Comment by Liang Zhen (Inactive) [ 15/Aug/14 ]

Isaac, could you please comment?

Comment by Isaac Huang (Inactive) [ 27/Aug/14 ]

There used to be a similar problem with conventional router pingers (i.e. without the asymmetrical pinger) at ORNL. ORNL often boots a whole client cluster (including the routers that connect to the server cluster) all together, so when a client's request arrives at a server there's a chance that all routers to the client cluster are still considered as dead by the server, then server will drop the reply as there's no route available to the client.

A possible solution is:
When a message arrives (in lnet_parse()) from a router, this is a good indication that the router is available. Check if our router status is up-to-date, in case the pinger hasn't been able to update it yet:

  • If the router is down, mark it as up.
  • If the router's corresponding far-side NI is down, mark it as up too.
Comment by Liang Zhen (Inactive) [ 11/Sep/14 ]

Due to Isaac's suggestion, I also try to address this issue in http://review.whamcloud.com/11748
It's not ready for product yet, now it's only for testing and discussing.
I may have a follow-on patch to reduce ping if router has recent aliveness information.

Comment by James A Simmons [ 10/Oct/14 ]

When we attempted to upgrade to 2.4 we had to turn off asym_router_failure in order to bring up our file system. Recently we upgraded to 2.5.3 and again we hit the issue of asym_router_failure breaking our systems. Currently we have it turned off in our system.

Comment by Liang Zhen (Inactive) [ 30/Oct/14 ]

I think we should have a dedicated patch for this issue, instead of putting everything in http://review.whamcloud.com/11748
Here is the patch, Isaac, could you take a look?
http://review.whamcloud.com/#/c/12453/

Comment by James A Simmons [ 03/Nov/14 ]

Liang does this patch need to be applied for both clients and servers?

Comment by Gerrit Updater [ 09/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/
Subject: LU-5485 lnet: peer aliveness status and NI status
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fb259fe85813e0f28ac7f7410689e3856ef26316

Comment by Jodi Levi (Inactive) [ 08/Jan/15 ]

Patch landed to Master. If there is more work to be done in this ticket, please reopen.

Comment by James A Simmons [ 08/Jan/15 ]

Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under LU-5758.

Comment by Gerrit Updater [ 27/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/
Subject: LU-5485 lnet: peer aliveness status and NI status
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c

Generated at Sat Feb 10 01:51:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.