[LU-630] mount failure after MGS connection lost and file system is unmounted Created: 24/Aug/11  Updated: 07/May/15  Resolved: 04/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2, Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Jeremy Filizetti Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

Various


Issue Links:
Related
is related to LU-441 ll_fill_super()) Unable to process lo... Resolved
Severity: 3
Rank (Obsolete): 4651

 Description   

I have had a long standing issue with the 1.8 series that causes mount failures when we lose our connection between the clients and servers. The environment is typically WAN clients accessing servers. During extended or unexpected WAN outages we often unmount the remote file system so the scripts/programs using df/statfs don't hang until the link is back up. When we attempt to remount that specific file system after the link is back the first mount attempt always gets an error -108 cannot send after transport end point shutdown. If you immediately try the mount command again you typically get a -17 file exists. After waiting for a while (usually I wait 60 seconds) the mount command succeeds.

I've looked at this problem and I think I now understand why it is happening but I'm not really sure of a correct fix for it.
The call stack to the problem looks like:
LNetPut
lnet_send
lnet_post_send_locked
lnet_peer_alive_locked
lnet_ni_peer_alive
lnd_query
kiblnd_query
kiblnd_launch_tx

Since the peer doesn't exist the kiblnd_launch_tx is called with a NULL tx to attempt to reestablish the connection. This calls kiblnd_connect_peer which does the connect asyncronously returning as soon as the rdma_resolve_addr is called. The kiblnd_cm_callback function handles the various states of connection through the rdma cm going from rdma_resolve_route, kiblnd_active_connect and finally making the connection. This is however happening after kiblnd_launch_tx, lnd_query, lnet_ni_peer_alive has returned and lnet_peer_alive_locked is checking if the peer is alive:

Since the rdma cm connection hasn't established the connection yet kiblnd_peer_alive which sets the lnd alive still hasn't been updated and lnet_send_post_send_locked and lnet_send end up returning -EHOSTUNREACH. I haven't locked at ksocklnd closely, but I think the issue exists there as well.



 Comments   
Comment by Peter Jones [ 03/Nov/11 ]

Lai

Could you please treat this issue as a priority?

Thanks

Peter

Comment by Lai Siyao [ 22/Nov/11 ]

Issac might describe this situation in https://bugzilla.lustre.org/show_bug.cgi?id=16186#c40 :

Also, lnd_query needs to be smarter
- e.g. if a peer has a very old aliveness stamp and I haven't tried to connect to him for a while,
it might be wise to assume it to be alive (and of course initiate new connections at the same
time).

But it looks like this is not implemented in the code; Liang will communicate with him to make
certain of this first.

Comment by Lai Siyao [ 05/Dec/11 ]

review is on http://review.whamcloud.com/#change,1797

Comment by Isaac Huang (Inactive) [ 22/Dec/11 ]

The peer_health code was designed for routers to reclaim buffers eagerly. It was NOT intended as a generic mechanism to detect peer health, despite its name.

It does not need to handle the situation described above, because when a server or client comes back to life the first thing it does is to ping the routers (i.e. check_routers_before_use). So if a router has a message to a peer who appears dead, it just calls lnd_query and drop the message when it's still dead (If it is back to life again, it should have pinged me and I should have a new aliveness timestamp).

I think the solution would be simple:
If I_am_not_a_router
return all_peer_is_alive;

If WC intends to improve peer_health into a generic mechanism, it has to be redesigned:

  • at least add a new peer state, "unknown". "unknown" should be treated much like "live" state, except that it should transform into "dead" state more eagerly. "dead" should transform into "unknown" if the last timestamp is too old and no attempts have since been made to connect to the peer. Just some food for thoughts, lots of design work to be done here.
  • think about how to cooperate with the planned "peer health network" feature.

And, only enable it after intensive testing.

Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » i686,server,el5,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » i686,client,el5,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 06/Apr/12 ]

Integrated in lustre-master » i686,server,el6,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 07/Apr/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 07/Apr/12 ]

Integrated in lustre-master » i686,client,el6,ofa #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 07/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 07/Apr/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #479
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el6,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,server,el5,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el6,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el5,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el5,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el6,inkernel #340
LU-630 lnet: only router checks peer health (Revision 51cef6191ae8722adb1045f66b5de4504dab83d4)

Result = SUCCESS
Oleg Drokin : 51cef6191ae8722adb1045f66b5de4504dab83d4
Files :

  • lnet/lnet/lib-move.c
Comment by Bob Glossman (Inactive) [ 03/May/12 ]

http://review.whamcloud.com/#change,2646
back port to b2_1

Comment by Peter Jones [ 04/Jun/12 ]

Landed for 2.1.2 and 2.3

Generated at Sat Feb 10 01:08:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.