Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13474

Lustre failover fails when SRPC enabled

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0, Lustre 2.12.4
    • 3
    • 9223372036854775807

    Description

      When srpc rules are enabled, either for Kerberos or SSK, Lustre HA failover is broken. That is to say, clients might not able to reconnect to targets that have been stopped and then restarted, whether on initial or pair node.

      Attachments

        Activity

          [LU-13474] Lustre failover fails when SRPC enabled

          Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40995
          Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: a963bddeacb6faec17e0aa4cfae8924a2771f512

          gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40995 Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: a963bddeacb6faec17e0aa4cfae8924a2771f512
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38310/
          Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 79c8abecdac052e3e00251547cc500f2cba742ab

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38310/ Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out Project: fs/lustre-release Branch: master Current Patch Set: Commit: 79c8abecdac052e3e00251547cc500f2cba742ab

          The first version of the patch was not working in all situations, for instance when a server has no running target. Indeed, in this case, the server is not able to return an error to the client.

          The second version of the patch implements a different solution. Now, the client is prevented from restarting GSS negotiation immediately if the RPC to the server timed out. It will let the HA failover mechanism try different service nodes.

          sebastien Sebastien Buisson added a comment - The first version of the patch was not working in all situations, for instance when a server has no running target. Indeed, in this case, the server is not able to return an error to the client. The second version of the patch implements a different solution. Now, the client is prevented from restarting GSS negotiation immediately if the RPC to the server timed out. It will let the HA failover mechanism try different service nodes.

          Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38310
          Subject: LU-13474 gss: return gss error when target not available
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ac17f865417d923f5ba5926f720ed7ddc95bb662

          gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38310 Subject: LU-13474 gss: return gss error when target not available Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ac17f865417d923f5ba5926f720ed7ddc95bb662

          The problem stems from the fact that clients are not able to try to connect to other service nodes for authentication requests. This is because the servers that receive an authentication request for a target that is not available for connect would simply drop the request, instead of returning an error to the client.

          sebastien Sebastien Buisson added a comment - The problem stems from the fact that clients are not able to try to connect to other service nodes for authentication requests. This is because the servers that receive an authentication request for a target that is not available for connect would simply drop the request, instead of returning an error to the client.

          People

            sebastien Sebastien Buisson
            sebastien Sebastien Buisson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: