[LU-630] mount failure after MGS connection lost and file system is unmounted Created: 24/Aug/11 Updated: 07/May/15 Resolved: 04/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2, Lustre 1.8.9 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jeremy Filizetti | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Various |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4651 | ||||||||
| Description |
|
I have had a long standing issue with the 1.8 series that causes mount failures when we lose our connection between the clients and servers. The environment is typically WAN clients accessing servers. During extended or unexpected WAN outages we often unmount the remote file system so the scripts/programs using df/statfs don't hang until the link is back up. When we attempt to remount that specific file system after the link is back the first mount attempt always gets an error -108 cannot send after transport end point shutdown. If you immediately try the mount command again you typically get a -17 file exists. After waiting for a while (usually I wait 60 seconds) the mount command succeeds. I've looked at this problem and I think I now understand why it is happening but I'm not really sure of a correct fix for it. Since the peer doesn't exist the kiblnd_launch_tx is called with a NULL tx to attempt to reestablish the connection. This calls kiblnd_connect_peer which does the connect asyncronously returning as soon as the rdma_resolve_addr is called. The kiblnd_cm_callback function handles the various states of connection through the rdma cm going from rdma_resolve_route, kiblnd_active_connect and finally making the connection. This is however happening after kiblnd_launch_tx, lnd_query, lnet_ni_peer_alive has returned and lnet_peer_alive_locked is checking if the peer is alive: Since the rdma cm connection hasn't established the connection yet kiblnd_peer_alive which sets the lnd alive still hasn't been updated and lnet_send_post_send_locked and lnet_send end up returning -EHOSTUNREACH. I haven't locked at ksocklnd closely, but I think the issue exists there as well. |
| Comments |
| Comment by Peter Jones [ 03/Nov/11 ] |
|
Lai Could you please treat this issue as a priority? Thanks Peter |
| Comment by Lai Siyao [ 22/Nov/11 ] |
|
Issac might describe this situation in https://bugzilla.lustre.org/show_bug.cgi?id=16186#c40 : Also, lnd_query needs to be smarter - e.g. if a peer has a very old aliveness stamp and I haven't tried to connect to him for a while, it might be wise to assume it to be alive (and of course initiate new connections at the same time). But it looks like this is not implemented in the code; Liang will communicate with him to make |
| Comment by Lai Siyao [ 05/Dec/11 ] |
|
review is on http://review.whamcloud.com/#change,1797 |
| Comment by Isaac Huang (Inactive) [ 22/Dec/11 ] |
|
The peer_health code was designed for routers to reclaim buffers eagerly. It was NOT intended as a generic mechanism to detect peer health, despite its name. It does not need to handle the situation described above, because when a server or client comes back to life the first thing it does is to ping the routers (i.e. check_routers_before_use). So if a router has a message to a peer who appears dead, it just calls lnd_query and drop the message when it's still dead (If it is back to life again, it should have pinged me and I should have a new aliveness timestamp). I think the solution would be simple: If WC intends to improve peer_health into a generic mechanism, it has to be redesigned:
And, only enable it after intensive testing. |
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 06/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Bob Glossman (Inactive) [ 03/May/12 ] |
|
http://review.whamcloud.com/#change,2646 |
| Comment by Peter Jones [ 04/Jun/12 ] |
|
Landed for 2.1.2 and 2.3 |