[LU-17446] BL AST should stop resending if lock is cancelled - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

It was recently observed on at least two sites that AST resend logic will keep resending ASTs until timeout is hit, even if the cancel for that lock has long been received and therefore it makes no sense to insist on getting the AST reply.

ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895372/real 1704895372]
    req@000000004fc3550d x1785656878161152/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
    10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895483 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895399/real 1704895399]
    req@000000001ee9fdbf x1785656878165248/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
    10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895510 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
ptlrpc_expire_one_request()) Skipped 1 previous similar message
lnet_handle_recovery_reply()) peer NI (10.0.1.101@tcp) recovery failed with -110
lnet_handle_recovery_reply()) Skipped 1837 previous similar messages
ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
    req@00000000674a17cf x1785656878167232/t0(0) o103->fs00-MDT0005-mdc-ff4c19380fe88000
    10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1704895426/real 1704895426]
    req@00000000eef404a4 x1785656878167168/t0(0) o103->fs00-MDT0004-mdc-ff4c19380fe88000
    10.0.1.102@tcp:12/10 lens 224/224 e 0 to 1 dl 1704895537 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''

Eventually once the timeout is hit, we even see the AST error message saying as much, that "Hey, the lock was already cancelled" so the client is not even evicted as the result :

(client.c:1273:ptlrpc_import_delay_req()) @@@ send limit expired  req@00000000063d38a1
   x1788029434094272/t0(0) lustre-MDT0001@1.1.1.1@tcp:15/16 lens 328/224 e 0 to 1
   dl 1705530737 ref 1 fl Rpc:XQU/2/ffffffff rc 0/-1 job:''


(ldlm_lockd.c:739:ldlm_handle_ast_error()) ### blocking AST (req@00000000063d38a1 x1788029434094272)
   timeout from nid 1.1.1.1@tcp, but cancel was received (AST reply lost?)
   ns: lustre-MDT0001_UUID lock: 000000008783e209/0x4937295654f3a72a lrc: 1/0,0 mode: --/PR
   res: [0x24001144f:0x8c35:0x0].0x0 bits 0x13/0x0 rrc: 6 type: IBT gid 0 flags: 0x44a01400000020
   nid: 1.1.1.2@tcp remote: 0x6196afd4e206854a expref: 7 pid: 376559 timeout: 337171 lvb_type: 0

Attachments

Issue Links

is related to

LU-17493 restore LDLM cancel on blocking callback

Open

LU-16004 Blocking callback for already cancelling lock that would have no IO

Open

is related to

LU-16357 a mechanism to inform other nodes to dump debug log

Open

LU-17426 parallel cross-directory rename of regular files on single MDT

Resolved

Activity

[LU-17446] BL AST should stop resending if lock is cancelled

Andreas Dilger added a comment - 02/Apr/24 3:26 PM

vitaly_fertman suggested a better approach on patch 53739 to resolve this issue:

once BL AST RPCs are sent out, we are waiting for the RPC set to be completed in full, whereas the key part for us to get the new lock granted. I think the right way to go here would be to delegate the RPC set to a separate thread (probably just let ptlrpcd to handle it asynchronously) and wait for the wanted lock itself. it would unblock the MDT thread even if some blocking AST are still in progress.

I am not sure we want to introduce a new complexity by linking BL ASTs to locks, we may let BL ASTs to be hanging around until the next resend, as it would not block new enqueues anymore (IIUC, it is a limited amount of BL ASTs having this problem, so not much mem consumption and seems to be not a big problem), so only ldlm_update_resend() of the current patch would be needed.

Andreas Dilger added a comment - 02/Apr/24 3:26 PM vitaly_fertman suggested a better approach on patch 53739 to resolve this issue: once BL AST RPCs are sent out, we are waiting for the RPC set to be completed in full, whereas the key part for us to get the new lock granted. I think the right way to go here would be to delegate the RPC set to a separate thread (probably just let ptlrpcd to handle it asynchronously) and wait for the wanted lock itself. it would unblock the MDT thread even if some blocking AST are still in progress. I am not sure we want to introduce a new complexity by linking BL ASTs to locks, we may let BL ASTs to be hanging around until the next resend, as it would not block new enqueues anymore (IIUC, it is a limited amount of BL ASTs having this problem, so not much mem consumption and seems to be not a big problem), so only ldlm_update_resend() of the current patch would be needed.

Gerrit Updater added a comment - 19/Jan/24 5:28 AM

"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53739
Subject: LU-17446 ldlm: Do not wait for AST RPC completion on lock cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 578d772b3481fab2fb5307cb34be8ffffb9f292e

Gerrit Updater added a comment - 19/Jan/24 5:28 AM "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53739 Subject: LU-17446 ldlm: Do not wait for AST RPC completion on lock cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 578d772b3481fab2fb5307cb34be8ffffb9f292e

BL AST should stop resending if lock is cancelled

Details

Description

Attachments

Issue Links

Activity

People

Dates