[LU-5485] first mount always fail with avoid_asym_router_failure - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: None
Labels:
None

Rank (Obsolete):
15306

Description

We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.

We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:

LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
mount will fail until the next time RC ping routers and get up-to-date information from them.

I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.

I don't have good solution yet, need more time to think about it, and discuss with Isaac.

Attachments

Issue Links

duplicates

LU-5785 recovery-mds-scale test_failover_ost: test_failover_ost returned 1

Resolved

is related to

LU-5758 enabling avoid_asym_router_failure prvents the bring up of ORNL production systems

Resolved

is related to

LU-6060 ARF doesn't detect lack of interface on a router

Resolved

mentioned in: Page Loading...

Activity

People

Assignee:: Liang Zhen (Inactive)

Reporter:: Liang Zhen (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 14/Aug/14 12:58 PM

Updated:: 27/Apr/15 8:45 PM

Resolved:: 08/Jan/15 1:54 PM