[LU-11912] reduce number of OST objects created per MDS Sequence Created: 31/Jan/19 Updated: 03/Nov/23 Resolved: 29/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.10.6 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs, performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
One issue with very large OSTs (e.g. > 256TiB or so) is that there can be too many objects in each OST object subdirectory. With e.g. a 384TiB OST and 1MiB objects there would need to be 384M objects on the OST, 12.6M objects in each of the 32 OST object subdirectories, which exceeds the standard ldiskfs limit 10M entries per directory. Although this limit of 320M objects per OST increases proportionately with each DNE MDT, not all filesystems have multiple MDTs. While the large_dir support allows this to "work" (i.e. the object directories can become arbitrarily large), the performance will be poor because each new object create or unlink is likely hashed to a different directory leaf block, and object create performance degrades to a random 4KB IOPS workload. There are performance inflections as each level of htree directory is added, at approximately 100k, 1M, and 10M entries (about 3MB, 30MB, 300MB of directory blocks respectively), so it makes sense to stick below the 10M entries per directory if possible. This is balanced by some aggregation of IO as the leaf blocks are shared for some time, so we don't necessarily want to continually roll over to new directories. For such a large OST, it would be better to reduce the number of objects created in each object directory, since the on-disk layout is transparent to the clients and MDS. One easy way to achieve this is to change the MDS object allocation code to reduce LUSTRE_DATA_SEQ_MAX_WIDTH from the current maximum 4B objects per SEQ, to something like 32M objects/SEQ (up to 1M objects per directory if none of them were ever deleted). After every e.g. 32M objects created on that OST it would roll over to a new SEQ with its own d{0..31}/ object subdirectories, which would be empty and provide the best create/unlink performance. This has the benefit of keeping the object directories smaller, and also aging out old objects from the working set rather than having random insertions in a large number of directory blocks. To optimize disk space usage, it would also need "lazy" online directory shrinking (e.g. release directory leaf blocks at the end of the directory if they are unused) so that as objects are deleted from the old directories they will use less space. There is already a preliminary patch for this "ext4: shrink directory when last block is empty" but it needs a bit more work to be ready for production. As a workaround, adding "e2fsck -fD" to e2fsck runs would clean up the old object directory blocks when they are empty. We'd also want to remove the empty object directories themselves when they are no longer used, but with directory shrinking that would only be about 100 blocks/SEQ ~= 400KB/SEQ, so not totally critical. |
| Comments |
| Comment by Andreas Dilger [ 08/Apr/20 ] |
|
Per comments in I had also tried setting "seq.cli-cli-*OST*.width=4096" on an unmodified system, but it doesn't appear that the values under "seq.*" are properly tied into the code. Changing the "width" did not seem to push object creation to a non-zero sequence for MDT0000. They also showed "space=[0x0-0x0]:0:mdt" and "fid=[0x0:0x0:0x0]" for all MDT->OST connections. |
| Comment by Dongyang Li [ 17/Apr/20 ] |
|
Andreas, after playing with LUSTRE_DATA_SEQ_MAX_WIDTH here is what I found, correct me if I'm wrong: on OST we place the objects under /O/{seq}/d[0-31], where seq is per MDT, for MDT0 the seq is FID_SEQ_OST_MDT0 which is 0. and looks like all the objects from one MDT will end up under the same /O/{seq}, even with reduced LUSTRE_DATA_SEQ_MAX_WIDTH the seq is still fixed for a single MDT. I think a walk around would be creating a number of MDTs so that one MDT would have 32M on the OST, limiting the number of objects under /O/{seq}/ another way could be adding a new level of dir under /O/{seq} with object allocation progress, and limit the level to have 32M objects before rotating/changing over, I'm looking at the related code to figure how to do this. |
| Comment by Andreas Dilger [ 17/Apr/20 ] |
|
Dongyang, it is presently that each OST is assigned a range of SEQ values that it sub-assigns to each MDT, and each MDT uses a sequence on the OST. MDT0000 is assigned SEQ=0, but ends up using IDIF SEQ, which is 0x1<ost_idx><OID> instead of the more normal FID SEQ 0x20000xxx range that is later assigned. What should happen is that when the MDT runs out of objects in the SEQ range (currently 4B objects), it gets a new SEQ for that OST and starts using it. This should happen rapidly if the SEQ range is small. It is possible that there are bugs in this mechanism, as it is used very rarely. Rather than work around the bugs, the code should be fixed to work properly. It would be good to get a bit more explanation of what is going wrong with the SEQ values/assignments on the MDT or OST. |
| Comment by Gerrit Updater [ 30/Apr/20 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/38424 |
| Comment by Andreas Dilger [ 30/Apr/20 ] |
|
There are already tunable parameters for the SEQ server on the MDS and OSS that could be used to manage the width for the range of FID sequences that were assigned by the OST to this "client" MDT (which is in turn a "client" of the SEQ controller for the testfs filesystem), shown below: seq.ctl-testfs-MDT0000.space=[0x380000400 - 0xffffffffffffffff]:0:mdt seq.ctl-testfs-MDT0000.width=1073741824 seq.cli-testfs-OST0001-super.fid=[0x0:0x0:0x0] *missing* seq.cli-testfs-OST0001-super.server=testfs-MDT0000_UUID seq.cli-testfs-OST0001-super.space=[0x0 - 0x0]:1:mdt *missing* seq.cli-testfs-OST0001-super.width=4294967295 seq.srv-testfs-OST0001.server=testfs-MDT0000_UUID seq.srv-testfs-OST0001.space=[0x2c0000bd0 - 0x300000400]:1:ost seq.cli-cli-testfs-OST0001-osc-MDT0001.fid=[0x0:0x0:0x0] *missing* seq.cli-cli-testfs-OST0001-osc-MDT0001.server=testfs-OST0001_UUID seq.cli-cli-testfs-OST0001-osc-MDT0001.space=[0x0 - 0x0]:0:mdt *missing* seq.cli-cli-testfs-OST0001-osc-MDT0001.width=4294967295 *ignored* The .width field is the correct IDIF_FID_SEQ_MAX_WIDTH value (without your patch), but it appears that the rest of the fields are not properly filled in, nor does changing ".width" on this MDT appear to affect the number of objects created on the OST, so that needs to be fixed. This should be checking ".width" instead of using a hard-coded value, and showing the most recent OST object FID that was allocated by this MDT as well as the range of remaining SEQ numbers. The last preallocated OST object FIDs on MDT0001 ican be seen via: osp.testfs-OST0001-osc-MDT0001.prealloc_last_id=5761 osp.testfs-OST0001-osc-MDT0001.prealloc_last_seq=0x2c0000400 osp.testfs-OST0001-osc-MDT0001.prealloc_next_id=5730 osp.testfs-OST0001-osc-MDT0001.prealloc_next_seq=0x2c0000400 but they should be linked to the seq.*.fid entries. On the regular Lustre client (which is itself a SEQ client of MDT0000's client of the master SEQ controller), this appears to be connected up and working properly, so it may be useful to look at the client MDC code: seq.srv-testfs-MDT0001.server=testfs-MDT0000_UUID
seq.srv-testfs-MDT0001.space=[0x240004a50 - 0x280000400]:1:mdt
seq.srv-testfs-MDT0001.width=1
seq.cli-testfs-MDT0001.server=srv-testfs-MDT0001
seq.cli-testfs-MDT0001.space=[0x240004282 - 0x240004282]:0:mdt
seq.cli-testfs-MDT0001.width=131072
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.fid=[0x240004282:0x7:0x0]
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.server=testfs-MDT0001_UUID
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.space=[0x240004283 - 0x240004283]:0:mdt
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=131072
# lfs setstripe -i 1 /mnt/testfs/dir1/bar
# lfs getstripe -vy /mnt/testfs/dir1/bar
lmm_fid: 0x240004282:0x7:0x0
lmm_objects:
- l_ost_idx: 1
l_fid: 0x2c0000400:0x1662:0x0
This shows the client created a file on testfs-MDT0001 (because of dirstriping on dir1), using a FID that it was assigned from testfs-MDT0001 for the inode, and the OST object that testfs-MDT0001 used was assigned to it by testfs-OST0001 (0x1662 = 5730). # lctl set_param seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=8 seq.cli-cli-testfs-MDT0000-mdc-ffff9f3ad3991800.width=8 # lfs setstripe -i 1 /mnt/testfs/dir1/bar /mnt/testfs/dir1/bag [root@centos7 lustre-copy]# lfs getstripe -vy /mnt/testfs/dir1/ba* /mnt/testfs/dir1/bag: [0x240004283:0x1:0x0] /mnt/testfs/dir1/baz: [0x240004282:0x8:0x0] /mnt/testfs/dir1/bar: [0x240004282:0x7:0x0] # lctl get_param seq.cli-cli-testfs-MDT0000-mdc-ff*.* seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.fid=[0x240004283:0x1:0x0] seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.server=testfs-MDT0001_UUID seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.space=[0x240004284 - 0x24000428]:0:mdt seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=8 This shows that on the regular MDC client, if I reduce the width and create new objects, the FID SEQ is rolled over to the next sequence, and the MDC client prefetches a new SEQ from its server (which happens to be the next SEQ number, but would not normally happen when there are many clients connected). |
| Comment by Gerrit Updater [ 25/Nov/21 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45659 |
| Comment by Gerrit Updater [ 17/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45659/ |
| Comment by Gerrit Updater [ 28/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/38424/ |
| Comment by Peter Jones [ 29/Mar/23 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 30/Mar/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50478 |
| Comment by Oleg Drokin [ 07/Apr/23 ] |
|
Looks like landings in this patch cause new crash in full testing, filed LU-16720 for it. |
| Comment by Gerrit Updater [ 22/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50478/ |
| Comment by Gerrit Updater [ 13/Jun/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51292 |
| Comment by Gerrit Updater [ 28/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51292/ |
| Comment by Gerrit Updater [ 18/Oct/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52741 |
| Comment by Gerrit Updater [ 23/Oct/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52801 |
| Comment by Gerrit Updater [ 03/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52741/ |
| Comment by Gerrit Updater [ 03/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52801/ |