Details
-
Improvement
-
Resolution: Fixed
-
Major
-
Lustre 2.12.0, Lustre 2.10.6
-
9223372036854775807
Description
One issue with very large OSTs (e.g. > 256TiB or so) is that there can be too many objects in each OST object subdirectory. With e.g. a 384TiB OST and 1MiB objects there would need to be 384M objects on the OST, 12.6M objects in each of the 32 OST object subdirectories, which exceeds the standard ldiskfs limit 10M entries per directory. Although this limit of 320M objects per OST increases proportionately with each DNE MDT, not all filesystems have multiple MDTs. While the large_dir support allows this to "work" (i.e. the object directories can become arbitrarily large), the performance will be poor because each new object create or unlink is likely hashed to a different directory leaf block, and object create performance degrades to a random 4KB IOPS workload.
There are performance inflections as each level of htree directory is added, at approximately 100k, 1M, and 10M entries (about 3MB, 30MB, 300MB of directory blocks respectively), so it makes sense to stick below the 10M entries per directory if possible. This is balanced by some aggregation of IO as the leaf blocks are shared for some time, so we don't necessarily want to continually roll over to new directories.
For such a large OST, it would be better to reduce the number of objects created in each object directory, since the on-disk layout is transparent to the clients and MDS. One easy way to achieve this is to change the MDS object allocation code to reduce LUSTRE_DATA_SEQ_MAX_WIDTH from the current maximum 4B objects per SEQ, to something like 32M objects/SEQ (up to 1M objects per directory if none of them were ever deleted). After every e.g. 32M objects created on that OST it would roll over to a new SEQ with its own d{0..31}/ object subdirectories, which would be empty and provide the best create/unlink performance. This has the benefit of keeping the object directories smaller, and also aging out old objects from the working set rather than having random insertions in a large number of directory blocks.
To optimize disk space usage, it would also need "lazy" online directory shrinking (e.g. release directory leaf blocks at the end of the directory if they are unused) so that as objects are deleted from the old directories they will use less space. There is already a preliminary patch for this "ext4: shrink directory when last block is empty" but it needs a bit more work to be ready for production. As a workaround, adding "e2fsck -fD" to e2fsck runs would clean up the old object directory blocks when they are empty. We'd also want to remove the empty object directories themselves when they are no longer used, but with directory shrinking that would only be about 100 blocks/SEQ ~= 400KB/SEQ, so not totally critical.
Attachments
Issue Links
- is related to
-
LU-17538 lov_objseq file contains 0x0BD0 contstant in low bytes
- Open
-
LU-14023 sanity test_56oc: @@@@@@ FAIL: '/usr/bin/lfs find /mnt/lustre/d56oc.sanity ! -neweram /mnt/lustre/f56oc.sanity.negnewer.am' wrong: found 12, expected 16
- Open
-
LU-16682 sanity-pfl test_1c: comp4 stripe count != 2000
- Open
-
LU-16863 sanity-pfl test_0b, replay-dual: open/close 8739 timeout
- Open
-
LU-17747 interop: sanity test_130g: filefrag printed 175 < 700 extents
- Open
-
LU-16692 replay-single: test_70c osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) )
- Resolved
-
LU-9547 LBUG osp_dev.c:755:osp_statfs()) ASSERTION( sfs->os_fprecreated <= OST_MAX_PRECREATE * 2 ) failed
- Resolved
-
LU-14345 e2fsck of very large directories is broken
- Resolved
-
LU-11546 enable large_dir support for MDTs
- Resolved
-
LU-16057 OBD_MD_FLGROUP not set for ladvise rpc
- Resolved
-
LU-16720 large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]
- Resolved
-
LU-12051 ldiskfs directory shrink
- Open
-
LU-17658 sanity check when ofd assign a new sequence to osp
- Open
-
LU-14692 deprecate use of OST FID SEQ 0 for MDT0000
- Resolved
- is related to
-
LU-16863 sanity-pfl test_0b, replay-dual: open/close 8739 timeout
- Open
- mentioned in
-
Page Loading...