[LU-17502] distribute flock locking across multiple MDS nodes - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.17.0
Labels:
- hard

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Lustre currently implements all of the flock locking only on MDT0000, and its MDS managing all of the flocks in the filesystem. This can lead to potential performance bottlenecks and high memory usage on the MDS when locking large numbers of files from many clients.

It would be desirable for the flock management to scale across multiple MDS nodes for improved performance and reduced load on MDS0. This would be fairly straight forward if clients would only lock one file at a time (e.g. manage file FID NNN flocking on the MDT where the FID is located). It gets slightly more complex if there is MDT migration to another server, which may cause imbalanced lock traffic (though it can't ever be worse than today where 100% of the flock locking is done on a single node).

The more serious issue is lock ordering issues with clients locking multiple FIDs at the same time (e.g. AB/BA deadlock both for extents within a single file and between different files, and more complex chain variants of the same). Is it possible/practical/efficient to distribute the flock deadlock/dependency checking across multiple servers? Non-flock Lustre DLM internal file consistency locking avoids this distributed ordering issue (most of the time) by avoiding to hold multiple locks at the same time, and when this is strictly necessary the locks are taken in a pre-determined order to avoid deadlocks (or use "trylock and undo/restart" for efficiency if lock(s) already held and the next lock is not in the correct order).

For flock locking, the lock ordering is provided by the userspace application (i.e. outside of Lustre's control) and the code must determine if granting the lock would result in a possible deadlock, and return an error to the application in that case.

It would be necessary to determine if there are efficient algorithms for distributed locking with deadlock detection, where the "common" case of independent flock locks on individual files is distributed across MDTs, while cross-MDS communication is minimized.

Another option to distribute the locking might be to determine the primary MDT for the lock management based on the JobID used by the application. That should put most/all of the flocks for a single job on a single MDT, even if the job is distributed across many client nodes, and reduce/eliminate cross-MDS communication for that job. However, if multiple jobs are locking the same files, or if the JobID is structured to contains the hostname, then the JobID would map to different MDTs and this would likely only increase complexity.

Attachments

Issue Links

is related to

LU-18823 LMR1b: Replicate MDT0000 Services to Other MDS

Open

LU-17276 Use interval tree in flock to improve the scalability

Resolved

Activity

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Feb/24 12:48 AM

Updated:: 17/Mar/25 11:07 PM