Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 1.8.6
-
None
-
Various
-
3
-
4651
Description
I have had a long standing issue with the 1.8 series that causes mount failures when we lose our connection between the clients and servers. The environment is typically WAN clients accessing servers. During extended or unexpected WAN outages we often unmount the remote file system so the scripts/programs using df/statfs don't hang until the link is back up. When we attempt to remount that specific file system after the link is back the first mount attempt always gets an error -108 cannot send after transport end point shutdown. If you immediately try the mount command again you typically get a -17 file exists. After waiting for a while (usually I wait 60 seconds) the mount command succeeds.
I've looked at this problem and I think I now understand why it is happening but I'm not really sure of a correct fix for it.
The call stack to the problem looks like:
LNetPut
lnet_send
lnet_post_send_locked
lnet_peer_alive_locked
lnet_ni_peer_alive
lnd_query
kiblnd_query
kiblnd_launch_tx
Since the peer doesn't exist the kiblnd_launch_tx is called with a NULL tx to attempt to reestablish the connection. This calls kiblnd_connect_peer which does the connect asyncronously returning as soon as the rdma_resolve_addr is called. The kiblnd_cm_callback function handles the various states of connection through the rdma cm going from rdma_resolve_route, kiblnd_active_connect and finally making the connection. This is however happening after kiblnd_launch_tx, lnd_query, lnet_ni_peer_alive has returned and lnet_peer_alive_locked is checking if the peer is alive:
Since the rdma cm connection hasn't established the connection yet kiblnd_peer_alive which sets the lnd alive still hasn't been updated and lnet_send_post_send_locked and lnet_send end up returning -EHOSTUNREACH. I haven't locked at ksocklnd closely, but I think the issue exists there as well.
Attachments
Issue Links
- is related to
-
LU-441 ll_fill_super()) Unable to process log: -108
- Resolved
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....