Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.16.0
-
3
-
9223372036854775807
Description
In case a GSS context for Lustre services is marked as error, try to replace it with an already existing context. Such a context can already be up-to-date thanks to authentication requests sent in parallel to failover NIDs.
Attachments
Activity
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53859/
Subject: LU-17483 gss: refresh req context with already existing one
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c76f7288fa772b48cf81050663e2124b25ab3994
I could imagine sending several SEC_CTX_INIT, but delaying MDS_CONNECT until a context is successfully created, no?
If I understand correctly what Mike said, the problem stems from the fact that requests are entirely prepared before being put in the delay list. So even if we delay MDS_CONNECT requests, they would point to the initial sec context, the one that is marked as ERROR/DEAD when the SEC_CTX_INIT request sent to the wrong NID fails.
Also, could we ignore the context from the delayed RPC or pick whatever context is valid when the RPC is poped from the delayed queue?
This is precisely what the patch "LU-17483 gss: refresh req context with already existing one" https://review.whamcloud.com/53859 does. Thanks to the patch, we are able to find another valid sec context via the struct ptlrpc_sec: the one used by the successful MDS_CONNECT request to the correct failover NID.
Why aren't we waiting for SEC_CTX to be established, in the first place? Having this async handling of context init and MDS_CONNECT, plus the delayed request is creating a complex situation? I could imagine sending several SEC_CTX_INIT, but delaying MDS_CONNECT until a context is successfully created, no?
Also, could we ignore the context from the delayed RPC or pick whatever context is valid when the RPC is poped from the delayed queue? Basically, ignore SEC_CTX while an RPC is in the delayed queue?
Sorry, getting back to this with some delay.
Yes Mike exactly, the problem is that different requests may point to different sec contexts. And indeed, thanks to your explanations about delayed requests, I understand now why the MDS_STATFS points to the wrong context: when the request is prepared, no connection is established yet, so it is put in the delay list. At this point, it references what we can call 'the initial' sec context. This request remains in the delay list as long as no connection is established.
In the meantime, because the requests to the primary NID failed/timed out, another MDS_CONNECT request is sent to the failover NID. For this request a new sec context is created, as the initial one is marked ERROR/DEAD. This MDS_CONNECT request succeeds, and the associated sec context is set as valid.
Now that the connection to the target is established, it is time for the MDS_STATFS request to be pulled out of the delay list. Unfortunately its associated sec context is the initial one, which is in error. Without patch "LU-17483 gss: refresh req context with already existing one", we have no choice but to return an error for this request. And this makes the client mount fail. With the patch, we are able to find another valid sec context via the struct ptlrpc_sec: the one used by the successful MDS_CONNECT request to the failover NID.
Let me put comments/questions in https://review.whamcloud.com/53859, I would like to understand your point about import generation.
I may misunderstand how security contexts are organized, it seems there is not single context per import as I thought. If that is true and each connection attempt sets own sec ctx, then it can be that MDS_STATFS is tied to an older context, as it was sent/delayed when import was trying to establish connection for invalid node, so when MDS_STATFS is taken from delayed list to proceed it finds its sec context is bad and fails. At the same time there is already new context related to just established connection. Again, I am not familiar with sec context organization over all import/connection attempts but if that is true somehow, then patch definitely makes sense. Probably it could be improved by using import generation, e.g. not replace sec context with an older one
Do you know why after succeeded SEC_CTX_INIT request ("client refreshed ctx") these MDS_CONNECT and MDS_STATFS are failed with -1 (EPERM)? This looks related to security context, but what leads to errors if context was refreshed successfully? So far I see that two failed CTX_INIT were completed before one successful, but that contradict with patch statement that erroneous context might overwrite the valid one.
The -EPERM is likely from sptlrpc_req_refresh_ctx():
if (unlikely(test_bit(PTLRPC_CTX_ERROR_BIT, &ctx->cc_flags))) { spin_lock(&req->rq_lock); req->rq_err = 1; spin_unlock(&req->rq_lock); req_off_ctx_list(req, ctx); RETURN(-EPERM); }
I suppose bit is set due to previous failed SEC_CTX_INIT attempts. But I don't know exactly how and at what moment bit is set, looks like lgss_keyring does that on its own. On other hand we had gss_cli_ctx_uptodate() and it doesn't clear ERROR_BIT if any. Can that be the reason? That doesn't make sense to set UPTODATE bit with ERROR bit also being set, we could just check it here and clear. Maybe that is not enough if lgss_keyring set ERROR bit in racy manner, e.g.:
lgss: start SEC_CTX_INIT, get timeout and error back to lgss ptlrpc: import is reconnected and gss_cli_ctx_uptodate() sets UPTODATE flag (no ERROR bit is set yet) lgss: calls gss_kt_update() and set ERROR bit over just initialized good context
It looks not quite possible as gss_cli_ctx_uptodate() is managed from lgss itself, but I am not familiar with that. Anyway, if that is not case then I think we can just check ERROR bit and drop it when UPTODATE bit is set
You can add couple debug messaged when ERROR bit is set and checked and enable at least D_RPCTRACE+D_HA+D_SEC debug levels, so we will see in debug log how it all goes along with all other events
As for ptlrpc parts:
- note that import reconnects to new NID not because SEC_CTX_INIT failed twice but when MDS_CONNECT request was expired. That triggers new reconnect round
- MDS_STATFS was never sent and you don't see that in log because it is put in delay list until import is ready (connection is established). Code path is ptlrpc_send_new_req()->ptlrpc_import_delay_req() It stays in delayed list and processed when connection is set or import fails. Usually that is true for all requests sent over not-ready import. Only CONNECT and SEC_CTX_INIT/FINI are not added as they are needed to establish connection
Here are some details on an issue I am hitting, and that is fixed with patch "LU-17483 gss: refresh req context with already existing one". For my test file system, a Kerberos flavor is enforced for cli2mdt connections. I have one MDS with just one MDT. I unmount this MDT from this node, and mount it on its secondary node. Then I try to mount a client, and get the following requests sequence:
testfs00000100:00100000:1.0:1708114041.787229:0:146185:0:(import.c:790:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) req@00000000489dfa89 x1791086957695936/t0(0) o38->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.203@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 00000100:00100000:0.0:1708114041.791224:0:146212:0:(client.c:1773:ptlrpc_send_new_req()) Sending RPC req@000000001f5e9c6e pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146212:1791086957696128:192.168.56.203@tcp:801: 00000100:00000400:0.0:1708114057.019458:0:146212:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708114041/real 1708114041] req@000000001f5e9c6e x1791086957696128/t0(0) o801->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.203@tcp:12/10 lens 224/224 e 0 to 1 dl 1708114057 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 00000100:00100000:0.0:1708114057.019554:0:146212:0:(client.c:2368:ptlrpc_expire_one_request()) @@@ err -110, sent_state=FULL (now=CONNECTING) req@000000001f5e9c6e x1791086957696128/t0(0) o801->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.203@tcp:12/10 lens 224/224 e 0 to 1 dl 1708114057 ref 2 fl Rpc:XNQU/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 00000100:00100000:0.0:1708114057.019575:0:146212:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@000000001f5e9c6e pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146212:1791086957696128:192.168.56.203@tcp:801: 00000100:00100000:0.0:1708114057.023445:0:146212:0:(client.c:1773:ptlrpc_send_new_req()) Sending RPC req@000000001f5e9c6e pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146212:1791086957696192:192.168.56.203@tcp:801: 00000100:00100000:0.0:1708114067.258781:0:145808:0:(client.c:1773:ptlrpc_send_new_req()) Sending RPC req@00000000aaeb6526 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:1f42f805-acf6-42f9-9d5e-db08b810d17b:145808:1791086957696256:192.168.56.202@tcp:400: 00000100:00100000:0.0:1708114067.261183:0:145808:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@00000000aaeb6526 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_00:1f42f805-acf6-42f9-9d5e-db08b810d17b:145808:1791086957696256:192.168.56.202@tcp:400: 00000100:00000400:1.0:1708114073.403360:0:146212:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1708114057/real 1708114057] req@000000001f5e9c6e x1791086957696192/t0(0) o801->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.203@tcp:12/10 lens 224/224 e 0 to 1 dl 1708114073 ref 2 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 00000100:00100000:1.0:1708114073.403388:0:146212:0:(client.c:2368:ptlrpc_expire_one_request()) @@@ err -110, sent_state=FULL (now=CONNECTING) req@000000001f5e9c6e x1791086957696192/t0(0) o801->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.203@tcp:12/10 lens 224/224 e 0 to 1 dl 1708114073 ref 2 fl Rpc:XNQU/200/ffffffff rc 0/-1 job:'' uid:0 gid:0 00000100:00100000:1.0:1708114073.403401:0:146212:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@000000001f5e9c6e pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146212:1791086957696192:192.168.56.203@tcp:801: 00000100:00100000:0.0:1708114073.416432:0:145807:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@00000000489dfa89 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:c56fc177-1e3f-443b-884e-5affc12b99a5:145807:1791086957695936:192.168.56.203@tcp:38: 00000100:00100000:0.0:1708114073.431404:0:145222:0:(import.c:790:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) req@00000000aaeb6526 x1791086957696320/t0(0) o38->testfs-MDT0000-mdc-ffff9e6953c2b000@192.168.56.202@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:NQU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 00000100:00100000:0.0:1708114073.436118:0:146218:0:(client.c:1773:ptlrpc_send_new_req()) Sending RPC req@00000000d17a0c88 pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146218:1791086957696384:192.168.56.202@tcp:801: 00000100:00100000:0.0:1708114073.448383:0:146218:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@00000000d17a0c88 pname:cluuid:pid:xid:nid:opc:job lgss_keyring:c56fc177-1e3f-443b-884e-5affc12b99a5:146218:1791086957696384:192.168.56.202@tcp:801: 02000000:08000000:0.0:1708114073.449114:0:146218:0:(sec_gss.c:375:gss_cli_ctx_uptodate()) client refreshed ctx 00000000920a6618 idx 0xe15a0943f309dd1b (0->testfs-MDT0000_UUID), expiry 1708117538(+3465s) 00000100:00100000:0.0:1708114073.458212:0:145807:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@00000000aaeb6526 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_rcv:c56fc177-1e3f-443b-884e-5affc12b99a5:145807:1791086957696320:192.168.56.202@tcp:38: 00000100:00100000:1.0:1708114073.458213:0:145809:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@00000000d609bc2a pname:cluuid:pid:xid:nid:opc:job ptlrpcd_00_01:c56fc177-1e3f-443b-884e-5affc12b99a5:0:1791086957696000:192.168.56.202@tcp:41: 00000100:00100000:1.0:1708114073.458226:0:146185:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@000000009d08d677 pname:cluuid:pid:xid:nid:opc:job mount.lustre:c56fc177-1e3f-443b-884e-5affc12b99a5:146185:1791086957696064:192.168.56.202@tcp:41: 00800000:00020000:1.0:1708114073.458275:0:146185:0:(lmv_obd.c:1337:lmv_statfs()) testfs-MDT0000-mdc-ffff9e6953c2b000: can't stat MDS #0: rc = -1 00000100:00100000:1.0:1708114073.465787:0:146185:0:(client.c:1773:ptlrpc_send_new_req()) Sending RPC req@000000009d08d677 pname:cluuid:pid:xid:nid:opc:job mount.lustre:c56fc177-1e3f-443b-884e-5affc12b99a5:146185:1791086957696448:192.168.56.202@tcp:39: 00000100:00100000:1.0:1708114073.467680:0:146185:0:(client.c:2255:ptlrpc_check_set()) Completed RPC req@000000009d08d677 pname:cluuid:pid:xid:nid:opc:job mount.lustre:c56fc177-1e3f-443b-884e-5affc12b99a5:146185:1791086957696448:192.168.56.202@tcp:39:
The client tries the primary NID first (192.168.56.203@tcp): we have an MDS_CONNECT rpc (opc 38) that triggers 2 SEC_CTX_INIT requests (opc 801). 2 because the sptlrpc retries once in case of failure. Requests fail, so a new MDS_CONNECT rpc is sent to the secondary NID (192.168.56.202@tcp). It triggers a SEC_CTX_INIT request that succeeds ("client refreshed ctx"), but MDS_CONNECT and MDS_STATFS are considered failed ("can't stat MDS #0: rc = -1").
And the client mount fails:
mount.lustre: mount 192.168.56.202@tcp:/testfs at /mnt/testfs failed: Operation not permitted
In this case, it does not look like a request/reply ordering issue, right? Unless there is a problem with the MDS_STATFS requests, because in my debug logs I can only see when they complete (xid 1791086957696000 and 1791086957696064), but not when they are sent (also note they refer to the same NID 192.168.56.202@tcp, but the first one was most likely sent to 192.168.56.203@tcp).
"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53859
Subject: LU-17483 gss: refresh req context with already existing one
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 774cbe839eb7afba4427fc913348d300ff891847
Merged for 2.16