[LU-17493] restore LDLM cancel on blocking callback - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.16.0, Lustre 2.17.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In the old days of Catamount on ASCI Red with liblustre running in the Catamount OS that did not have any CPU interrupts. That meant any server-to-client requests (such as DLM lock cancellations) must be handled asynchronously on the client when the application yielded the processor to filesystem administrative tasks.

In that environment, the server would immediately assume that a DLM lock was cancelled as soon as the AST was sent on a lock with LDLM_FL_CANCEL_ON_BLOCK set on a lock, rather than waiting for the client to reply to the AST and actually cancel the lock. This avoided potentially significant delays for servers granting locks.

In large clusters, there are some locks that are invariably highly contended (e.g. ROOT/, /home/ or /project directories, either because many clients are holding a read lock and some client wants to modify the directory, or because of conflicting workloads (e.g. "ls -l" or "rm" in a directory (tree) that is actively in use by other clients. If any client holding a contended lock has a problem, for example ~~LU-17453~~/~~LU-17476~~, then other nodes accessing that lock may block for tens or hundreds of seconds until it is cancelled or the client is evicted.

It would be useful if LDLM_FL_CANCEL_ON_BLOCK was used for such highly-contended resources when requested with LCK_PR mode, so that the server can send asynchronous ASTs to all clients and then cancel the DLM locks rapidly and perform the required operation without getting blocked by unresponsive clients. Any responsive client will receive the AST and not even need to send the cancel RPC, while unresponsive clients are already unlikely to know or care whether the server sent the AST, so they will have an inconsistent local state until they again contact the server (as they already do today).

This could potentially also be tied into "ls" (readdir()) being able to run with "LDLM_FL_CANCEL_ON_BLOCK" locks, or no DLM locks at all on the directory or inodes. Per comments in LU-3308, POSIX does not require readdir() to be fully cache coherent even among processes on the same node, only that the readdir cache is reset with rewinddir() and close().

Attachments

Issue Links

is related to

LU-16564 Remove FL_CANCEL_ON_BLOCK

Open

LU-18759 LNET avoid initiating server to client connection

Open

is related to

LU-17446 BL AST should stop resending if lock is cancelled

Open

LU-3308 large readdir chunk size slows unlink/"rm -r" performance

Reopened

LU-11509 LDLM: replace client lock LRU with improved cache algorithm

Open

Activity

[LU-17493] restore LDLM cancel on blocking callback

Andreas Dilger added a comment - 24/Jun/25 4:56 PM

Yes, getting the directory locks with cancel-on-lock would avoid the MDS blocking access to the whole filesystem if e.g. a client holding a lock on the root directory suddenly becomes unresponsive. I think the ROOT/ (or subdirectory mount) directory should always be a candidate for CANCEL_ON_BLOCK, as would other directories that have a large number of lock holders (decided by the MDS).

We might consider also applying this to all IBITS LCK_PR locks held by the client if it has been evicted more than once within some time period (e.g. 15 minutes)?

Andreas Dilger added a comment - 24/Jun/25 4:56 PM Yes, getting the directory locks with cancel-on-lock would avoid the MDS blocking access to the whole filesystem if e.g. a client holding a lock on the root directory suddenly becomes unresponsive. I think the ROOT/ (or subdirectory mount) directory should always be a candidate for CANCEL_ON_BLOCK , as would other directories that have a large number of lock holders (decided by the MDS). We might consider also applying this to all IBITS LCK_PR locks held by the client if it has been evicted more than once within some time period (e.g. 15 minutes)?

Keguang Xu added a comment - 23/Jun/25 4:42 AM - edited

Hi adilger, green, how about expand the concept of a “contended directory” here?

A large directory containing hundreds of thousands of entries, where an LCK_PR lock held by an "ls" operation could block rm/rename/create operations. In this scenario, the directory isn’t necessarily "hot", and the directory inode.i_size could serve as a useful indicator?
A hot directory under heavy access, with tens of concurrent operations. For a small directory, the "ls" shouldn’t take much time; however, "a problematic client holding an LCK_PR lock may cause other nodes to be blocked for tens or even hundreds of seconds until the lock is canceled or the client is evicted". In this case, an adjusted LDLM contention criteria might be applicable. We're not aiming to address less contended directory with a problematic client here, as the impact would be limited to fewer clients.

A follow-up question: From the discussion in LU-3308, “keeping cached readdir() data after cancellation would allow better performance.” Should be addressed separately in another patch targeting large directories?

Keguang Xu added a comment - 23/Jun/25 4:42 AM - edited Hi adilger , green , how about expand the concept of a “contended directory” here? A large directory containing hundreds of thousands of entries, where an LCK_PR lock held by an "ls" operation could block rm/rename/create operations. In this scenario, the directory isn’t necessarily "hot", and the directory inode.i_size could serve as a useful indicator? A hot directory under heavy access, with tens of concurrent operations. For a small directory, the "ls" shouldn’t take much time; however, "a problematic client holding an LCK_PR lock may cause other nodes to be blocked for tens or even hundreds of seconds until the lock is canceled or the client is evicted". In this case, an adjusted LDLM contention criteria might be applicable. We're not aiming to address less contended directory with a problematic client here, as the impact would be limited to fewer clients. A follow-up question: From the discussion in LU-3308 , “keeping cached readdir() data after cancellation would allow better performance.” Should be addressed separately in another patch targeting large directories?

Peter Jones made changes - 20/Jun/25 12:24 PM

Assignee

Original: WC Triage [ wc-triage ]

New: Keguang Xu [ squalfof ]

Gerrit Updater added a comment - 20/Jun/25 6:42 AM

"kg.xu <squalfof@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59862
Subject: LU-17493 mdc: restore LDLM cancel on blocking callback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 44bcc5a5c51f01fc71880361de2545fa6dec5dae

Gerrit Updater added a comment - 20/Jun/25 6:42 AM "kg.xu <squalfof@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59862 Subject: LU-17493 mdc: restore LDLM cancel on blocking callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 44bcc5a5c51f01fc71880361de2545fa6dec5dae

Andreas Dilger added a comment - 18/Jun/25 10:30 AM

Hi squalfof, it would be prudent to limit this behavior to readdir looks to start, since they have specific "weak" semantics under POSIX, so pre-emptively cancelling them from the server does not introduce any consistency issues.

It would also be good to get input from green on this topic, since he is the expert in this area.

Andreas Dilger added a comment - 18/Jun/25 10:30 AM Hi squalfof , it would be prudent to limit this behavior to readdir looks to start, since they have specific "weak" semantics under POSIX, so pre-emptively cancelling them from the server does not introduce any consistency issues. It would also be good to get input from green on this topic, since he is the expert in this area.

Keguang Xu added a comment - 18/Jun/25 9:48 AM

Hi @Andreas, some questions need your help to clarify,
1. The scope of this issue is "ls" only?
2. The definition of "contended" could be found in LDLM. For ANY directory conforms to "contended" should we apply the cancel logic? Or we just cancel ANY lock tagged with BLOCK_ON_CANCEL?

Thanks.

Keguang Xu added a comment - 18/Jun/25 9:48 AM Hi @Andreas, some questions need your help to clarify, 1. The scope of this issue is "ls" only? 2. The definition of "contended" could be found in LDLM. For ANY directory conforms to "contended" should we apply the cancel logic? Or we just cancel ANY lock tagged with BLOCK_ON_CANCEL? Thanks.

Andreas Dilger made changes - 28/Feb/25 1:12 AM

Link

New: This issue is related to LU-18759 [ LU-18759 ]

Andreas Dilger made changes - 25/Feb/25 5:07 PM

Link

New: This issue is related to LU-17446 [ LU-17446 ]

Andreas Dilger made changes - 25/Feb/25 4:53 PM

Link

New: This issue is related to LU-16564 [ LU-16564 ]