[LU-17317] sanity-sec test_16: test all_off:60001:c0:60003:003, wanted 1 1, got 0 0 Created: 28/Nov/23 Updated: 09/Jan/24 Resolved: 09/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d7c41dec-3599-4738-b89d-03f240498d8c test_16 failed with the following error: test all_off:60001:c0:60003:003, wanted 1 1, got 0 0 Test session details: VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Sebastien Buisson [ 04/Dec/23 ] |
|
For instance for this failure in sanity-sec test_17: We have the following nodes:
The MDS wants a client to release a lock, by sending an LDLM_BL_CALLBACK request (104). But it gets an error from the client (called a 'reverse server' in the GSS exchanges). 00000100:00100000:1.0:1701460059.716146:0:10439:0:(client.c:1758:ptlrpc_send_new_req()) Sending RPC req@0000000014e4d538 pname:cluuid:pid:xid:nid:opc:job mdt00_003:lustre-MDT0002_UUID:10439:1784106942685952:10.240.43.200@tcp:104: 00000100:00100000:1.0:1701460059.716168:0:10439:0:(client.c:2533:ptlrpc_set_wait()) set 00000000e75f5696 going to sleep for 11 seconds 02000000:00000400:1.0:1701460059.716391:0:10439:0:(sec_gss.c:685:gss_cli_ctx_handle_err_notify()) req x1784106942685952/t0, ctx 0000000082de400c idx 0x544590679e7e5f1c(0->c): reverse server respond (00080000/00000000) 00000100:00020000:1.0:1701460059.716395:0:10439:0:(client.c:1479:after_reply()) @@@ unwrap reply failed: rc = -22 req@0000000014e4d538 x1784106942685952/t0(0) o104->lustre-MDT0002@10.240.43.200@tcp:15/16 lens 328/224 e 0 to 0 dl 1701460070 ref 1 fl Rpc:RQU/0/ffffffff rc 0/-1 job:'' 00000100:00100000:1.0:1701460059.716403:0:10439:0:(client.c:2239:ptlrpc_check_set()) Completed RPC req@0000000014e4d538 pname:cluuid:pid:xid:nid:opc:job mdt00_003:lustre-MDT0002_UUID:10439:1784106942685952:10.240.43.200@tcp:104: And indeed on the client, the GSS context id (544590679e7e5f1c) is considered invalid. 00000100:00100000:0.0:1701460059.716218:0:8154:0:(events.c:373:request_in_callback()) peer: 12345-10.240.43.205@tcp (source: 12345-10.240.43.205@tcp) 02000000:00000400:0.0:1701460059.716236:0:207145:0:(gss_svc_upcall.c:1619:gss_svc_upcall_get_ctx()) Invalid gss ctx idx 0x544590679e7e5f1c from 10.240.43.205@tcp 02000000:08000000:0.0:1701460059.716239:0:207145:0:(sec_gss.c:1969:gss_pack_err_notify()) prepare gss error notify(0x80000/0x0) to 10.240.43.205@tcp So this client ends up not refreshing its lock. This is a problem as file/dir access rights were changed from a different client. |
| Comment by Gerrit Updater [ 08/Dec/23 ] |
|
"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53377 |
| Comment by Sebastien Buisson [ 08/Dec/23 ] |
|
The problem described here stems from the fact that server side can use outdated gss contexts in ldlm callbacks. Apparently this was fine with previous gss code based on sunrpc cache, because the cache entries were removed (very) asynchronously. With the new implementation based on the upcall cache, the cache entries are removed as they are found expired. This explains why with this new code, the server gets GSS_S_NO_CONTEXT from an evicted client if the server has sent an outdated gss context. So patch " |
| Comment by Gerrit Updater [ 11/Dec/23 ] |
|
"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53405 |
| Comment by Gerrit Updater [ 20/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53377/ |
| Comment by Gerrit Updater [ 03/Jan/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53375/ |