[LU-17317] sanity-sec test_16: test all_off:60001:c0:60003:003, wanted 1 1, got 0 0 Created: 28/Nov/23  Updated: 09/Jan/24  Resolved: 09/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17228 sanity test_36i: expect 200 got 16981... Resolved
is related to LU-17286 recovery-small test_66 timeout Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d7c41dec-3599-4738-b89d-03f240498d8c

test_16 failed with the following error:

test all_off:60001:c0:60003:003, wanted 1 1, got 0 0

Test session details:
clients: https://build.whamcloud.com/job/lustre-b_es-reviews/15743 - 4.18.0-477.27.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-b_es-reviews/15743 - 4.18.0-477.27.1.el8_lustre.ddn17.x86_64

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-sec test_16 - test all_off:60001:c0:60003:003, wanted 1 1, got 0 0



 Comments   
Comment by Sebastien Buisson [ 04/Dec/23 ]

For instance for this failure in sanity-sec test_17:
https://testing.whamcloud.com/test_sets/1037004b-dbe8-44ef-a668-4eed6e6114eb

We have the following nodes:

  • MDS 1, MDS 3 (trevis-94vm6)
    10.240.43.205@tcp
  • Client 1 (trevis-94vm1)
    10.240.43.200@tcp

The MDS wants a client to release a lock, by sending an LDLM_BL_CALLBACK request (104). But it gets an error from the client (called a 'reverse server' in the GSS exchanges).

00000100:00100000:1.0:1701460059.716146:0:10439:0:(client.c:1758:ptlrpc_send_new_req()) Sending RPC req@0000000014e4d538 pname:cluuid:pid:xid:nid:opc:job mdt00_003:lustre-MDT0002_UUID:10439:1784106942685952:10.240.43.200@tcp:104:
00000100:00100000:1.0:1701460059.716168:0:10439:0:(client.c:2533:ptlrpc_set_wait()) set 00000000e75f5696 going to sleep for 11 seconds
02000000:00000400:1.0:1701460059.716391:0:10439:0:(sec_gss.c:685:gss_cli_ctx_handle_err_notify()) req x1784106942685952/t0, ctx 0000000082de400c idx 0x544590679e7e5f1c(0->c): reverse server respond (00080000/00000000)
00000100:00020000:1.0:1701460059.716395:0:10439:0:(client.c:1479:after_reply()) @@@ unwrap reply failed: rc = -22  req@0000000014e4d538 x1784106942685952/t0(0) o104->lustre-MDT0002@10.240.43.200@tcp:15/16 lens 328/224 e 0 to 0 dl 1701460070 ref 1 fl Rpc:RQU/0/ffffffff rc 0/-1 job:''
00000100:00100000:1.0:1701460059.716403:0:10439:0:(client.c:2239:ptlrpc_check_set()) Completed RPC req@0000000014e4d538 pname:cluuid:pid:xid:nid:opc:job mdt00_003:lustre-MDT0002_UUID:10439:1784106942685952:10.240.43.200@tcp:104:

And indeed on the client, the GSS context id (544590679e7e5f1c) is considered invalid.

00000100:00100000:0.0:1701460059.716218:0:8154:0:(events.c:373:request_in_callback()) peer: 12345-10.240.43.205@tcp (source: 12345-10.240.43.205@tcp)
02000000:00000400:0.0:1701460059.716236:0:207145:0:(gss_svc_upcall.c:1619:gss_svc_upcall_get_ctx()) Invalid gss ctx idx 0x544590679e7e5f1c from 10.240.43.205@tcp
02000000:08000000:0.0:1701460059.716239:0:207145:0:(sec_gss.c:1969:gss_pack_err_notify()) prepare gss error notify(0x80000/0x0) to 10.240.43.205@tcp

So this client ends up not refreshing its lock. This is a problem as file/dir access rights were changed from a different client.

Comment by Gerrit Updater [ 08/Dec/23 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53377
Subject: LU-17317 gss: no cache flush for rsi and rsc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a785f7c4c2a71064cb61023e479ded0347d16b72

Comment by Sebastien Buisson [ 08/Dec/23 ]

The problem described here stems from the fact that server side can use outdated gss contexts in ldlm callbacks. Apparently this was fine with previous gss code based on sunrpc cache, because the cache entries were removed (very) asynchronously. With the new implementation based on the upcall cache, the cache entries are removed as they are found expired. This explains why with this new code, the server gets GSS_S_NO_CONTEXT from an evicted client if the server has sent an outdated gss context.

So patch "LU-17317 gss: do not continue using expired reverse context" https://review.whamcloud.com/53375 aims at surviving this situation. The server is still allowed to try to use an outdated gss context (this is important for inflight communications), but if it gets GSS_S_NO_CONTEXT from the client, then it marks this gss context as dead, and replaces it with a new one.

Comment by Gerrit Updater [ 11/Dec/23 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53405
Subject: LU-17317 dbg: investigate test failures - 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4b9b24ab89f4992ba40272cb0fddda0636a0152e

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53377/
Subject: LU-17317 gss: no cache flush for rsi and rsc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3615fa4a86be793652d53c94818c5aeb81e2257e

Comment by Gerrit Updater [ 03/Jan/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53375/
Subject: LU-17317 gss: do not continue using expired reverse context
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67acf6047e343a0e35f077c6aed4483a14d2015c

Generated at Sat Feb 10 03:34:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.