[LU-10531] GSS, Shared Key and Kerberos support broken in master and lustre 2.10 Created: 18/Jan/18  Updated: 09/Feb/18  Resolved: 06/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Sebastien Buisson (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: gss, kerberos, patch

Issue Links:
Related
is related to LU-9795 SSK test failures in many suites when... Reopened
is related to LU-7854 sanity-gss test_1 fails with 'chmod /... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

GSS, Shared Key and Kerberos support is currently broken in master branch. It is indeed impossible to set any flavor for sptlrpc, whereas it is gssnull or ski or krb5

{n,a,i,p}

.

For instance, when doing 'lctl conf_param lustre.srpc.flavor.default=krb5n' or 'lctl set_param -P lustre.srpc.flavor.default=krb5n', the command returns no error, but the value is never applied.

The commit introducing this regression is the following, and aims at making 'lctl set_param -P' functional:
https://review.whamcloud.com/28590

As mentioned in this patch's comment, "currently virtual attributes failover.nid, sptlrpc, and quota
are not fully supported. They will be addressed in later patches".

As I understand 'lctl set_param -P' needs more work to make it work for sptlrpc, the patch should not break 'lctl conf_param' functionality for sptlrpc.



 Comments   
Comment by Peter Jones [ 18/Jan/18 ]

James

It looks like we should revert this change

Peter

Comment by James A Simmons [ 18/Jan/18 ]

Reverting will not help out since in my own testing sptlrpc was broken before this patch landed  I just haven't had the cycles to track down the issue of why it was broken. Also in my testing tlrpc in lustre 2.10 is broken. Just to let you know.

Comment by Peter Jones [ 18/Jan/18 ]

ok - thanks James. We'll hold off on the revert then

Comment by Sebastien Buisson (Inactive) [ 19/Jan/18 ]

Hi James, thanks for looking into this.

What do you mean by "sptlrpc was broken before patch https://review.whamcloud.com/28590, and in 2.10"? While trying to narrow down the problem hit with 'lctl conf_param lustre.srpc.flavor.default=krb5n', I found it worked for all codes I was able to compile, except when patch https://review.whamcloud.com/28590 was in the pile. By working I mean issuing the command and then seing that the given value had been taken into account under /proc/fs/lustre///srpc_info on the clients and servers.

Comment by Gerrit Updater [ 19/Jan/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/30937
Subject: LU-10531 obd: handle case tgt equals fsname for obdname2fsname
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 95c5be01ec49795e36acae14223b7bdf24091c7a

Comment by James A Simmons [ 19/Jan/18 ]

Back to the normal failures 

[ 445.426749] LustreError: 14736:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err 0, gss err d0000
[ 445.428655] Lustre: 14730:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff8808453fa840(0->lustre-MDT0000_UUID) get expired: 1516348553(+300s)
[ 445.428657] Lustre: 14730:0:(sec_gss.c:315:cli_ctx_expire()) Skipped 2 previous similar messages
[ 445.428757] Lustre: 4861:0:(sec_gss.c:1225:gss_cli_ctx_fini_common()) gss.keyring@ffff88102ffd3800: destroy ctx ffff8808453fa840(0->lustre-MDT0000_UUID)
[ 445.428759] Lustre: 4861:0:(sec_gss.c:1225:gss_cli

Comment by James A Simmons [ 19/Jan/18 ]

Sebastien I posted the errors I'm seeing in the previous comment. For some reason the client can mount but no one can access the file system.

The patch is ready for review

Comment by Andreas Dilger [ 19/Jan/18 ]

It seems that we are still not getting regular enough testing of the Kerberos and SSK functionality to avoid regressions, and playing catch-up with regressions added a long time ago is a lot more work than finding recent regressions or preventing them in the first place.

Nathan, Chris, Sebastien, is it possible for you guys to start running automated regression tests against master with SSK/Kerberos configured on a regular basis (e.g. daily against master, and as often as possible against new patches as time permits)? That would avoid these kinds of problems from being introduced in the first place.

Comment by Gerrit Updater [ 23/Jan/18 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/30984
Subject: LU-10531 gss: fix GSS support for DNE
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0781aab3dabd84f9ad65c3e29613f8c29d4a8aa1

Comment by Sebastien Buisson (Inactive) [ 23/Jan/18 ]

Thanks to the patch https://review.whamcloud.com/30937 from James, I am again able to set sptlrpc flavor with 'lctl conf_param' commands (however it does not work with 'lctl set_param -P').
Then I managed to have a working kerberized Lustre on my test system. I can access the FS from Lustre clients, and do not see the error messages showed by James here.

However, for DNE setups the new patch I just pushed in https://review.whamcloud.com/30984 is mandatory. This is indeed another regression in Kerberos support, inadvertently introduced by patch https://review.whamcloud.com/27823.
This new patch needs to be landed for 2.11.

I agree we need more regular testing of GSS/SSK/Kerberos functionality. Manual testing of single patches cannot cover all cases all the time. At DDN we already have some resources for Lustre non-regression tests. I will see if it is possible to dedicate part of them to continuous Kerberos testing. I should get back to you on this matter in a couple of weeks.

Comment by Gerrit Updater [ 31/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30937/
Subject: LU-10531 obd: handle case tgt equals fsname for obdname2fsname
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ac01abc2db2e82f87061eb0e6b2c03e28dad6a5b

Comment by Gerrit Updater [ 06/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30984/
Subject: LU-10531 gss: fix GSS support for DNE
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7327f66c2ca1d9762f6ea722f1433e4435f0a5b5

Comment by Peter Jones [ 06/Feb/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 07/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31208
Subject: LU-10531 obd: handle case tgt equals fsname for obdname2fsname
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 107076812f7ec31ce8c01fabc582282320965ea8

Comment by Gerrit Updater [ 07/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31209
Subject: LU-10531 gss: fix GSS support for DNE
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: d577c60fdcaf59cd407c7e4e84074f04f1ae1466

Comment by Gerrit Updater [ 09/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31208/
Subject: LU-10531 obd: handle case tgt equals fsname for obdname2fsname
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e23318986fa839997c504c1d87f73e937f7e9a7b

Comment by Gerrit Updater [ 09/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31209/
Subject: LU-10531 gss: fix GSS support for DNE
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 3d270d3a5a9ffec79aa7d6ab4a7f131afbfb06d2

Generated at Sat Feb 10 02:35:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.