Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
15306
Description
We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster.
We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:
- LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to "DOWN" after a couple of minutes.
- RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.
- before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.
- while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).
- mount will fail until the next time RC ping routers and get up-to-date information from them.
I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.
I don't have good solution yet, need more time to think about it, and discuss with Isaac.
Attachments
Issue Links
- duplicates
-
LU-5785 recovery-mds-scale test_failover_ost: test_failover_ost returned 1
- Resolved
- is related to
-
LU-5758 enabling avoid_asym_router_failure prvents the bring up of ORNL production systems
- Resolved
- is related to
-
LU-6060 ARF doesn't detect lack of interface on a router
- Resolved
- mentioned in
-
Page Loading...