Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-630

mount failure after MGS connection lost and file system is unmounted

    XMLWordPrintable

Details

    • 3
    • 4651

    Description

      I have had a long standing issue with the 1.8 series that causes mount failures when we lose our connection between the clients and servers. The environment is typically WAN clients accessing servers. During extended or unexpected WAN outages we often unmount the remote file system so the scripts/programs using df/statfs don't hang until the link is back up. When we attempt to remount that specific file system after the link is back the first mount attempt always gets an error -108 cannot send after transport end point shutdown. If you immediately try the mount command again you typically get a -17 file exists. After waiting for a while (usually I wait 60 seconds) the mount command succeeds.

      I've looked at this problem and I think I now understand why it is happening but I'm not really sure of a correct fix for it.
      The call stack to the problem looks like:
      LNetPut
      lnet_send
      lnet_post_send_locked
      lnet_peer_alive_locked
      lnet_ni_peer_alive
      lnd_query
      kiblnd_query
      kiblnd_launch_tx

      Since the peer doesn't exist the kiblnd_launch_tx is called with a NULL tx to attempt to reestablish the connection. This calls kiblnd_connect_peer which does the connect asyncronously returning as soon as the rdma_resolve_addr is called. The kiblnd_cm_callback function handles the various states of connection through the rdma cm going from rdma_resolve_route, kiblnd_active_connect and finally making the connection. This is however happening after kiblnd_launch_tx, lnd_query, lnet_ni_peer_alive has returned and lnet_peer_alive_locked is checking if the peer is alive:

      Since the rdma cm connection hasn't established the connection yet kiblnd_peer_alive which sets the lnd alive still hasn't been updated and lnet_send_post_send_locked and lnet_send end up returning -EHOSTUNREACH. I haven't locked at ksocklnd closely, but I think the issue exists there as well.

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jfilizetti Jeremy Filizetti
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: