[LU-7854] sanity-gss test_1 fails with 'chmod /lustre/scratch failed' Created: 03/Mar/16  Updated: 15/Mar/18  Resolved: 15/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Sebastien Buisson (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

COmbined MGS/MDS with one MDT, two OSSs with two OSTs, a single client. Kerberos is setup


Issue Links:
Blocker
is blocked by LU-9567 sptlrpc rules are not being updated Resolved
Related
is related to LU-10531 GSS, Shared Key and Kerberos support ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As part of the Kerberos test plan, I ran the test-group/regression suite of tests including sanity-gss. I set up a Kerberos environment and ran the regression test group. sanity-gss tests 1 and 2 fail in the same way; with a connection refused error:

== sanity-gss test 1: create file == 10:46:36 (1457001996)
chmod: cannot access `/lustre/scratch': Connection refused
 sanity-gss test_1: @@@@@@ FAIL: chmod /lustre/scratch failed 

From the dmesg from the MDS for test 2, we see

Lustre: DEBUG MARKER: Setting sptlrpc rule: scratch.srpc.flavor.default=gssnull
Lustre: 9164:0:(gss_mech_switch.c:72:lgss_mech_register()) Register gssnull mechanism
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) req ffff880036c03c80: fail to get context
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) Skipped 1423 previous similar messages
LustreError: 29799:0:(pinger.c:105:ptlrpc_ping()) OOM trying to ping scratch-MDT0000-lwp-MDT0000_UUID->scratch-MDT0000_UUID
LustreError: 7734:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 7734:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) Skipped 203419 previous similar messages
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) req ffff880036c03c80: fail to get context
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) Skipped 209084 previous similar messages
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) Skipped 461885 previous similar messages
LustreError: 7730:0:(sec.c:440:sptlrpc_req_get_ctx()) req ffff880036c03980: fail to get context
LustreError: 7730:0:(sec.c:440:sptlrpc_req_get_ctx()) Skipped 467259 previous similar messages
Lustre: DEBUG MARKER: == sanity-gss test 1: create file == 10:46:36 (1457001996)
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) Skipped 950299 previous similar messages
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) req ffff880036c03c80: fail to get context
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) Skipped 958900 previous similar messages
Lustre: DEBUG MARKER: sanity-gss test_1: @@@@@@ FAIL: chmod /lustre/scratch failed
LustreError: 29799:0:(pinger.c:105:ptlrpc_ping()) OOM trying to ping scratch-MDT0000-mdtlov_UUID->scratch-OST0000_UUID
Lustre: DEBUG MARKER: == sanity-gss test 2: lfs flushctx == 10:46:40 (1457002000)
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) failed request key: -126
LustreError: 7732:0:(gss_keyring.c:791:gss_sec_lookup_ctx_kr()) Skipped 1953286 previous similar messages
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) req ffff880036c03c80: fail to get context
LustreError: 7736:0:(sec.c:440:sptlrpc_req_get_ctx()) Skipped 1963518 previous similar messages
Lustre: DEBUG MARKER: sanity-gss test_2: @@@@@@ FAIL: chmod /lustre/scratch failed
LustreError: 29799:0:(pinger.c:105:ptlrpc_ping()) OOM trying to ping scratch-MDT0000-mdtlov_UUID->scratch-OST0000_UUID
LustreError: 29799:0:(pinger.c:105:ptlrpc_ping()) Skipped 4 previous similar messages

Only three of 13 tests passed on this and previous runs of sanity-gss. The logs for the sanity-gss test suite are at https://testing.hpdd.intel.com/test_sets/09108032-e16b-11e5-8edf-5254006e85c2



 Comments   
Comment by Andreas Dilger [ 04/Mar/16 ]

Is this a regression, or has this test not been run/passed recently?

Comment by James Nunez (Inactive) [ 04/Mar/16 ]

This is not a regression and it looks like this test suite is rarely executed. Looking at Maloo results for sanity-gss for the past year and a half, it looks like this test suite has not passed testing in that time frame.

Comment by Andreas Dilger [ 07/Mar/16 ]

Jeremy, Sebastien, have you ever run the sanity-gss.sh script successfully? Are there any patches needed to get this working in our test environment?

Sebastien, this is something that should be run as part of your patch http://review.whamcloud.com/18781 "LU-7845 gss: support namespace in lgss_keyring" to ensure there are no regressions.

Comment by Jeremy Filizetti [ 07/Mar/16 ]

I believe I've had it run through once, but so far I haven't added patches for SK testing. Looking at sanity-gss it's using gssnull which requires the lsvcgssd to be running but I don't see any start_gss_daemons called there. The -126 is -ENOKEY which would make sense given that gssnull requires lsvcgssd to be running with the -z flag (from my last patch set for shared key).

Comment by James Nunez (Inactive) [ 07/Mar/16 ]

Jeremy - Thanks for the comment. I have lsvcgssd running on all servers, but I didn't give it any flags.

Comment by Sebastien Buisson (Inactive) [ 08/Mar/16 ]

Thanks Andreas.
I have never run sanity-gss in my environment, but I think Jeremy's suggestion could do it. When it is fixed, I will add sanity-gss to the Test-Parameters of my patch at http://review.whamcloud.com/18781.

Comment by Peter Jones [ 18/Apr/17 ]

Is this still a live issue or was it fixed under LU-7845?

Comment by Sebastien Buisson (Inactive) [ 26/Apr/17 ]

Hi,

This issue is not directly related to LU-7845, although I once proposed to run sanity-gss while review of http://review.whamcloud.com/18781 was in progress. And it is not fixed now that patch from LU-7845 is merged.

I think the best way to tackle this is to push a new patch to modify sanity-gss according to Jeremy’s suggestion (i.e. starting lsvcgssd) and add sanity-gss to the Test-Parameters of the patch.

Sebastien.

Comment by Gerrit Updater [ 01/Jun/17 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/27383
Subject: LU-7854 tests: start gss daemons in sanity-gss
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 81d2e2f4f15b0cb71d44cafe95e7e0c17380ce29

Comment by Sebastien Buisson (Inactive) [ 01/Jun/17 ]

Hi,

Following Jeremy's suggestion, I uploaded patch https://review.whamcloud.com/27383 in order to start lsvcgssd deamons with '-z' flag for gssnull flavor.
This enables test_1 to pass, but I found some instabilities with gssnull flavor while working on sanity-gss. See LU-9582. However, I think patch https://review.whamcloud.com/27383 should be landed independently of those instabilities, that could be addressed separately.

Sebastien.

Comment by Sebastien Buisson (Inactive) [ 07/Jun/17 ]

The tests cannot pass because of the problem described in LU-9567.

Comment by Peter Jones [ 15/Jan/18 ]

sbuisson just to confirm - you are now able to move forward with this fix now, right?

Comment by Sebastien Buisson (Inactive) [ 16/Jan/18 ]

Sure, I have just refreshed patch at https://review.whamcloud.com/27383.

Comment by Sebastien Buisson (Inactive) [ 18/Jan/18 ]

FYI, work on GSS (or Shared Key or Kerberos) is currently blocked because of issue described in LU-10531, due to the following patch recently landed to master branch:
https://review.whamcloud.com/28590

Comment by Peter Jones [ 06/Feb/18 ]

sbuisson is work now able to proceed since the patches for LU-10531 have landed?

Comment by Gerrit Updater [ 15/Feb/18 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/31317
Subject: LU-7854 gss: install lgssc.conf under /etc/request-key.d/
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef1a4773d457ca9be475c77bf1f8632da777e634

Comment by Gerrit Updater [ 15/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31317/
Subject: LU-7854 gss: install lgssc.conf under /etc/request-key.d/
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e299df1e9eeae5d20f3b8a544a5f4be4fd30872c

Comment by Gerrit Updater [ 15/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27383/
Subject: LU-7854 tests: start gss daemons in sanity-gss
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 69b2712e2ffe485ee21408b849040d42b5a9aa2a

Comment by Peter Jones [ 15/Mar/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:12:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.