[LU-13474] Lustre failover fails when SRPC enabled Created: 22/Apr/20  Updated: 16/Dec/20  Resolved: 03/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Sebastien Buisson Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: gss, patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When srpc rules are enabled, either for Kerberos or SSK, Lustre HA failover is broken. That is to say, clients might not able to reconnect to targets that have been stopped and then restarted, whether on initial or pair node.



 Comments   
Comment by Sebastien Buisson [ 22/Apr/20 ]

The problem stems from the fact that clients are not able to try to connect to other service nodes for authentication requests. This is because the servers that receive an authentication request for a target that is not available for connect would simply drop the request, instead of returning an error to the client.

Comment by Gerrit Updater [ 22/Apr/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38310
Subject: LU-13474 gss: return gss error when target not available
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ac17f865417d923f5ba5926f720ed7ddc95bb662

Comment by Sebastien Buisson [ 27/Apr/20 ]

The first version of the patch was not working in all situations, for instance when a server has no running target. Indeed, in this case, the server is not able to return an error to the client.

The second version of the patch implements a different solution. Now, the client is prevented from restarting GSS negotiation immediately if the RPC to the server timed out. It will let the HA failover mechanism try different service nodes.

Comment by Gerrit Updater [ 03/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38310/
Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 79c8abecdac052e3e00251547cc500f2cba742ab

Comment by Peter Jones [ 03/Dec/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 16/Dec/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40995
Subject: LU-13474 gss: do not return -ERESTART when gss rpc times out
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a963bddeacb6faec17e0aa4cfae8924a2771f512

Generated at Sat Feb 10 03:01:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.