Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11912

reduce number of OST objects created per MDS Sequence



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Lustre 2.12.0, Lustre 2.10.6
    • Lustre 2.16.0
    • 9223372036854775807


      One issue with very large OSTs (e.g. > 256TiB or so) is that there can be too many objects in each OST object subdirectory. With e.g. a 384TiB OST and 1MiB objects there would need to be 384M objects on the OST, 12.6M objects in each of the 32 OST object subdirectories, which exceeds the standard ldiskfs limit 10M entries per directory. Although this limit of 320M objects per OST increases proportionately with each DNE MDT, not all filesystems have multiple MDTs. While the large_dir support allows this to "work" (i.e. the object directories can become arbitrarily large), the performance will be poor because each new object create or unlink is likely hashed to a different directory leaf block, and object create performance degrades to a random 4KB IOPS workload.

      There are performance inflections as each level of htree directory is added, at approximately 100k, 1M, and 10M entries (about 3MB, 30MB, 300MB of directory blocks respectively), so it makes sense to stick below the 10M entries per directory if possible. This is balanced by some aggregation of IO as the leaf blocks are shared for some time, so we don't necessarily want to continually roll over to new directories.

      For such a large OST, it would be better to reduce the number of objects created in each object directory, since the on-disk layout is transparent to the clients and MDS. One easy way to achieve this is to change the MDS object allocation code to reduce LUSTRE_DATA_SEQ_MAX_WIDTH from the current maximum 4B objects per SEQ, to something like 32M objects/SEQ (up to 1M objects per directory if none of them were ever deleted).  After every e.g. 32M objects created on that OST it would roll over to a new SEQ with its own d{0..31}/ object subdirectories, which would be empty and provide the best create/unlink performance.  This has the benefit of keeping the object directories smaller, and also aging out old objects from the working set rather than having random insertions in a large number of directory blocks.

      To optimize disk space usage, it would also need "lazy" online directory shrinking (e.g. release directory leaf blocks at the end of the directory if they are unused) so that as objects are deleted from the old directories they will use less space.  There is already a preliminary patch for this "ext4: shrink directory when last block is empty" but it needs a bit more work to be ready for production. As a workaround, adding "e2fsck -fD" to e2fsck runs would clean up the old object directory blocks when they are empty. We'd also want to remove the empty object directories themselves when they are no longer used, but with directory shrinking that would only be about 100 blocks/SEQ ~= 400KB/SEQ, so not totally critical.


        Issue Links



              dongyang Dongyang Li
              adilger Andreas Dilger
              0 Vote for this issue
              12 Start watching this issue