[LU-5485] first mount always fail with avoid_asym_router_failure Created: 14/Aug/14 Updated: 27/Apr/15 Resolved: 08/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Liang Zhen (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 15306 | ||||||||||||||||||||
| Description |
|
We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this cluster. We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:
I think users didn't hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive. I don't have good solution yet, need more time to think about it, and discuss with Isaac. |
| Comments |
| Comment by Liang Zhen (Inactive) [ 15/Aug/14 ] |
|
Isaac, could you please comment? |
| Comment by Isaac Huang (Inactive) [ 27/Aug/14 ] |
|
There used to be a similar problem with conventional router pingers (i.e. without the asymmetrical pinger) at ORNL. ORNL often boots a whole client cluster (including the routers that connect to the server cluster) all together, so when a client's request arrives at a server there's a chance that all routers to the client cluster are still considered as dead by the server, then server will drop the reply as there's no route available to the client. A possible solution is:
|
| Comment by Liang Zhen (Inactive) [ 11/Sep/14 ] |
|
Due to Isaac's suggestion, I also try to address this issue in http://review.whamcloud.com/11748 |
| Comment by James A Simmons [ 10/Oct/14 ] |
|
When we attempted to upgrade to 2.4 we had to turn off asym_router_failure in order to bring up our file system. Recently we upgraded to 2.5.3 and again we hit the issue of asym_router_failure breaking our systems. Currently we have it turned off in our system. |
| Comment by Liang Zhen (Inactive) [ 30/Oct/14 ] |
|
I think we should have a dedicated patch for this issue, instead of putting everything in http://review.whamcloud.com/11748 |
| Comment by James A Simmons [ 03/Nov/14 ] |
|
Liang does this patch need to be applied for both clients and servers? |
| Comment by Gerrit Updater [ 09/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12453/ |
| Comment by Jodi Levi (Inactive) [ 08/Jan/15 ] |
|
Patch landed to Master. If there is more work to be done in this ticket, please reopen. |
| Comment by James A Simmons [ 08/Jan/15 ] |
|
Mounting now works with ARF. Now ARF just doesn't work for us. That work can be completed under |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12435/ |