Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17502

distribute flock locking across multiple MDS nodes

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0, Lustre 2.17.0
    • 3
    • 9223372036854775807

    Description

      Lustre currently implements all of the flock locking only on MDT0000, and its MDS managing all of the flocks in the filesystem. This can lead to potential performance bottlenecks and high memory usage on the MDS when locking large numbers of files from many clients.

      It would be desirable for the flock management to scale across multiple MDS nodes for improved performance and reduced load on MDS0. This would be fairly straight forward if clients would only lock one file at a time (e.g. manage file FID NNN flocking on the MDT where the FID is located). It gets slightly more complex if there is MDT migration to another server, which may cause imbalanced lock traffic (though it can't ever be worse than today where 100% of the flock locking is done on a single node).

      The more serious issue is lock ordering issues with clients locking multiple FIDs at the same time (e.g. AB/BA deadlock both for extents within a single file and between different files, and more complex chain variants of the same). Is it possible/practical/efficient to distribute the flock deadlock/dependency checking across multiple servers? Non-flock Lustre DLM internal file consistency locking avoids this distributed ordering issue (most of the time) by avoiding to hold multiple locks at the same time, and when this is strictly necessary the locks are taken in a pre-determined order to avoid deadlocks (or use "trylock and undo/restart" for efficiency if lock(s) already held and the next lock is not in the correct order).

      For flock locking, the lock ordering is provided by the userspace application (i.e. outside of Lustre's control) and the code must determine if granting the lock would result in a possible deadlock, and return an error to the application in that case.

      It would be necessary to determine if there are efficient algorithms for distributed locking with deadlock detection, where the "common" case of independent flock locks on individual files is distributed across MDTs, while cross-MDS communication is minimized.

      Another option to distribute the locking might be to determine the primary MDT for the lock management based on the JobID used by the application. That should put most/all of the flocks for a single job on a single MDT, even if the job is distributed across many client nodes, and reduce/eliminate cross-MDS communication for that job. However, if multiple jobs are locking the same files, or if the JobID is structured to contains the hostname, then the JobID would map to different MDTs and this would likely only increase complexity.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: