Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.12.0, Lustre 2.10.6
Labels:
- ldiskfs
- performance

Rank (Obsolete):
9223372036854775807

Description

One issue with very large OSTs (e.g. > 256TiB or so) is that there can be too many objects in each OST object subdirectory. With e.g. a 384TiB OST and 1MiB objects there would need to be 384M objects on the OST, 12.6M objects in each of the 32 OST object subdirectories, which exceeds the standard ldiskfs limit 10M entries per directory. Although this limit of 320M objects per OST increases proportionately with each DNE MDT, not all filesystems have multiple MDTs. While the large_dir support allows this to "work" (i.e. the object directories can become arbitrarily large), the performance will be poor because each new object create or unlink is likely hashed to a different directory leaf block, and object create performance degrades to a random 4KB IOPS workload.

There are performance inflections as each level of htree directory is added, at approximately 100k, 1M, and 10M entries (about 3MB, 30MB, 300MB of directory blocks respectively), so it makes sense to stick below the 10M entries per directory if possible. This is balanced by some aggregation of IO as the leaf blocks are shared for some time, so we don't necessarily want to continually roll over to new directories.

For such a large OST, it would be better to reduce the number of objects created in each object directory, since the on-disk layout is transparent to the clients and MDS. One easy way to achieve this is to change the MDS object allocation code to reduce LUSTRE_DATA_SEQ_MAX_WIDTH from the current maximum 4B objects per SEQ, to something like 32M objects/SEQ (up to 1M objects per directory if none of them were ever deleted). After every e.g. 32M objects created on that OST it would roll over to a new SEQ with its own d{0..31}/ object subdirectories, which would be empty and provide the best create/unlink performance. This has the benefit of keeping the object directories smaller, and also aging out old objects from the working set rather than having random insertions in a large number of directory blocks.

To optimize disk space usage, it would also need "lazy" online directory shrinking (e.g. release directory leaf blocks at the end of the directory if they are unused) so that as objects are deleted from the old directories they will use less space. There is already a preliminary patch for this "ext4: shrink directory when last block is empty" but it needs a bit more work to be ready for production. As a workaround, adding "e2fsck -fD" to e2fsck runs would clean up the old object directory blocks when they are empty. We'd also want to remove the empty object directories themselves when they are no longer used, but with directory shrinking that would only be about 100 blocks/SEQ ~= 400KB/SEQ, so not totally critical.

Attachments

Issue Links

is related to

LU-17538 lov_objseq file contains 0x0BD0 contstant in low bytes

Open

LU-14023 sanity test_56oc: @@@@@@ FAIL: '/usr/bin/lfs find /mnt/lustre/d56oc.sanity ! -neweram /mnt/lustre/f56oc.sanity.negnewer.am' wrong: found 12, expected 16

Open

LU-16682 sanity-pfl test_1c: comp4 stripe count != 2000

Open

LU-16863 sanity-pfl test_0b, replay-dual: open/close 8739 timeout

Open

LU-17747 interop: sanity test_130g: filefrag printed 175 < 700 extents

Open

LU-16692 replay-single: test_70c osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) )

Reopened

LU-9547 LBUG osp_dev.c:755:osp_statfs()) ASSERTION( sfs->os_fprecreated <= OST_MAX_PRECREATE * 2 ) failed

Resolved

LU-14345 e2fsck of very large directories is broken

Resolved

LU-11546 enable large_dir support for MDTs

Resolved

LU-16057 OBD_MD_FLGROUP not set for ladvise rpc

Resolved

LU-16720 large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]

Resolved

LU-12051 ldiskfs directory shrink

Open

LU-17658 sanity check when ofd assign a new sequence to osp

Open

LU-14692 deprecate use of OST FID SEQ 0 for MDT0000

Resolved

is related to

LU-16863 sanity-pfl test_0b, replay-dual: open/close 8739 timeout

Open

mentioned in: Page Loading...

(9 is related to, 1 is related to , 1 mentioned in)

Activity

People

Assignee:: Dongyang Li

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 31/Jan/19 11:39 PM

Updated:: 15/Apr/24 8:36 AM

Resolved:: 29/Mar/23 3:15 AM