Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18843

allow parallel rename inside a single directory

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.17.0, Lustre 2.16.1

    Description

      Currently, the mdt_reint_rename() implementation will allow regular files to be renamed within a single directory (LU-12125), or between different directories on a single MDT (LU-17426). The MDS will lock the parent directory FIDs as well as source and target file FIDs for serialization, which allows different clients to rename in different directories in parallel.

      This is already a significant improvement over local filesystem rename operations, which are serialized at the VFS level for the entire filesystem, though there are efforts underway to relax the VFS locking for some directory operations (LU-17776).

      However, for some workloads there are a lot of concurrent file renames within a single directory (e.g. parallel SPARK or data retrieval workloads) from creating temporary dot-files during write and then renaming them to the final name afterward.

      It would be useful to allow regular filenames to be renamed in parallel within a single MDT directory. This would need the locking in mdt_reint_rename() to be more fine-grained for regular file renames, by only taking the "hash lock" on the parent directories, which is the parent FID + filename hash for the source and target filenames. This would avoid contention on the parent FID lock, and allow rename operations to be more concurrent.

      Since this parallel locking would only apply to renames within the same MDT, if a striped directory is in use then it is important that the dot-files created on one MDT remain on the same MDT. Otherwise, a cross-MDT rename is required with a distributed transaction, and the parallel rename optimizations are not possible. There are some optimizations possible when creating the temporary dot-files in striped directories using the CRUSH2 hash algorithm (LU-15720). This will automatically detect temporary dot-files filenames created with the very common mktemp(".final_filename.XXXXXX") command using a template of with a leading ".", the filename, a trailing ".", and 6 or 10 "X" characters that are replaced. The CRUSH2 hash will only hash the "final_filename" part of the file, so that when the rename is done to "final_filename" it will hash to the same MDT, and the rename can happen in parallel.

      There is some rare possibility that two filenames have the same hash that would generate the same LDLM lock resource, but this is very unlikely to impact workloads. Care must be taken in mds_reint_rename() to order the LDLM FID+hash locks, so that there is not an AB/BA deadlock for two concurrent renames within the same directory (which may involve different filenames that have the same hash).

      The renames would still be subject to serialization in the directory htree locking at the ldiskfs level. The htree locking is based on locking the source and destination leaf blocks in the directory, and possibly intermediate index blocks if they need to be split during the rename. The htree locking scales with the size of the directory, so larger directories will have finer-grained locking, which matches the desired performance behavior.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: