[LU-11912] reduce number of OST objects created per MDS Sequence Created: 31/Jan/19  Updated: 03/Nov/23  Resolved: 29/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.6
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: ldiskfs, performance

Issue Links:
Blocker
Related
is related to LU-16863 sanity-pfl test_0b, replay-dual: open... Open
is related to LU-14023 sanity test_56oc: @@@@@@ FAIL: '/usr/... Open
is related to LU-16682 sanity-pfl test_1c: comp4 stripe coun... Open
is related to LU-16692 replay-single: test_70c osp_fid_diff(... Open
is related to LU-16720 large-scale test_3a osp_precreate_rol... Open
is related to LU-16863 sanity-pfl test_0b, replay-dual: open... Open
is related to LU-9547 LBUG osp_dev.c:755:osp_statfs()) ASSE... Resolved
is related to LU-14345 e2fsck of very large directories is b... Resolved
is related to LU-11546 enable large_dir support for MDTs Resolved
is related to LU-16057 OBD_MD_FLGROUP not set for ladvise rpc Resolved
is related to LU-12051 ldiskfs directory shrink Open
is related to LU-14692 deprecate use of OST FID SEQ 0 for MD... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

One issue with very large OSTs (e.g. > 256TiB or so) is that there can be too many objects in each OST object subdirectory. With e.g. a 384TiB OST and 1MiB objects there would need to be 384M objects on the OST, 12.6M objects in each of the 32 OST object subdirectories, which exceeds the standard ldiskfs limit 10M entries per directory. Although this limit of 320M objects per OST increases proportionately with each DNE MDT, not all filesystems have multiple MDTs. While the large_dir support allows this to "work" (i.e. the object directories can become arbitrarily large), the performance will be poor because each new object create or unlink is likely hashed to a different directory leaf block, and object create performance degrades to a random 4KB IOPS workload.

There are performance inflections as each level of htree directory is added, at approximately 100k, 1M, and 10M entries (about 3MB, 30MB, 300MB of directory blocks respectively), so it makes sense to stick below the 10M entries per directory if possible. This is balanced by some aggregation of IO as the leaf blocks are shared for some time, so we don't necessarily want to continually roll over to new directories.

For such a large OST, it would be better to reduce the number of objects created in each object directory, since the on-disk layout is transparent to the clients and MDS. One easy way to achieve this is to change the MDS object allocation code to reduce LUSTRE_DATA_SEQ_MAX_WIDTH from the current maximum 4B objects per SEQ, to something like 32M objects/SEQ (up to 1M objects per directory if none of them were ever deleted).  After every e.g. 32M objects created on that OST it would roll over to a new SEQ with its own d{0..31}/ object subdirectories, which would be empty and provide the best create/unlink performance.  This has the benefit of keeping the object directories smaller, and also aging out old objects from the working set rather than having random insertions in a large number of directory blocks.

To optimize disk space usage, it would also need "lazy" online directory shrinking (e.g. release directory leaf blocks at the end of the directory if they are unused) so that as objects are deleted from the old directories they will use less space.  There is already a preliminary patch for this "ext4: shrink directory when last block is empty" but it needs a bit more work to be ready for production. As a workaround, adding "e2fsck -fD" to e2fsck runs would clean up the old object directory blocks when they are empty. We'd also want to remove the empty object directories themselves when they are no longer used, but with directory shrinking that would only be about 100 blocks/SEQ ~= 400KB/SEQ, so not totally critical.



 Comments   
Comment by Andreas Dilger [ 08/Apr/20 ]

Per comments in LU-9547 last year, I hit that LASSERT while built with a reduced LUSTRE_DATA_SEQ_MAX_WIDTH = 0x1000 to exercise the SEQ changeover handling. It looks like the problem is that when the sequence rolls over and osp_precreate_rollover_new_seq() sets opd_pre_last_created_fid == opd_pre_used_fid the opd_pre_reserved value may be non-zero and is not changed.

I had also tried setting "seq.cli-cli-*OST*.width=4096" on an unmodified system, but it doesn't appear that the values under "seq.*" are properly tied into the code. Changing the "width" did not seem to push object creation to a non-zero sequence for MDT0000. They also showed "space=[0x0-0x0]:0:mdt" and "fid=[0x0:0x0:0x0]" for all MDT->OST connections.

Comment by Dongyang Li [ 17/Apr/20 ]

Andreas, after playing with LUSTRE_DATA_SEQ_MAX_WIDTH here is what I found, correct me if I'm wrong:

on OST we place the objects under /O/{seq}/d[0-31], where seq is per MDT, for MDT0 the seq is FID_SEQ_OST_MDT0 which is 0. and looks like all the objects from one MDT will end up under the same /O/{seq}, even with reduced LUSTRE_DATA_SEQ_MAX_WIDTH the seq is still fixed for a single MDT.

I think a walk around would be creating a number of MDTs so that one MDT would have 32M on the OST, limiting the number of objects under /O/{seq}/

another way could be adding a new level of dir under /O/{seq} with object allocation progress, and limit the level to have 32M objects before rotating/changing over, I'm looking at the related code to figure how to do this.

Comment by Andreas Dilger [ 17/Apr/20 ]

Dongyang, it is presently that each OST is assigned a range of SEQ values that it sub-assigns to each MDT, and each MDT uses a sequence on the OST. MDT0000 is assigned SEQ=0, but ends up using IDIF SEQ, which is 0x1<ost_idx><OID> instead of the more normal FID SEQ 0x20000xxx range that is later assigned.

What should happen is that when the MDT runs out of objects in the SEQ range (currently 4B objects), it gets a new SEQ for that OST and starts using it. This should happen rapidly if the SEQ range is small.

It is possible that there are bugs in this mechanism, as it is used very rarely. Rather than work around the bugs, the code should be fixed to work properly. It would be good to get a bit more explanation of what is going wrong with the SEQ values/assignments on the MDT or OST.

Comment by Gerrit Updater [ 30/Apr/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/38424
Subject: LU-11912 ofd: reduce LUSTRE_DATA_SEQ_MAX_WIDTH
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aeb814066448024cb5cd619d0f25bd98c4182868

Comment by Andreas Dilger [ 30/Apr/20 ]

There are already tunable parameters for the SEQ server on the MDS and OSS that could be used to manage the width for the range of FID sequences that were assigned by the OST to this "client" MDT (which is in turn a "client" of the SEQ controller for the testfs filesystem), shown below:

seq.ctl-testfs-MDT0000.space=[0x380000400 - 0xffffffffffffffff]:0:mdt
seq.ctl-testfs-MDT0000.width=1073741824
seq.cli-testfs-OST0001-super.fid=[0x0:0x0:0x0] *missing*
seq.cli-testfs-OST0001-super.server=testfs-MDT0000_UUID
seq.cli-testfs-OST0001-super.space=[0x0 - 0x0]:1:mdt *missing*
seq.cli-testfs-OST0001-super.width=4294967295
seq.srv-testfs-OST0001.server=testfs-MDT0000_UUID
seq.srv-testfs-OST0001.space=[0x2c0000bd0 - 0x300000400]:1:ost
seq.cli-cli-testfs-OST0001-osc-MDT0001.fid=[0x0:0x0:0x0] *missing*
seq.cli-cli-testfs-OST0001-osc-MDT0001.server=testfs-OST0001_UUID
seq.cli-cli-testfs-OST0001-osc-MDT0001.space=[0x0 - 0x0]:0:mdt *missing*
seq.cli-cli-testfs-OST0001-osc-MDT0001.width=4294967295 *ignored*

The .width field is the correct IDIF_FID_SEQ_MAX_WIDTH value (without your patch), but it appears that the rest of the fields are not properly filled in, nor does changing ".width" on this MDT appear to affect the number of objects created on the OST, so that needs to be fixed. This should be checking ".width" instead of using a hard-coded value, and showing the most recent OST object FID that was allocated by this MDT as well as the range of remaining SEQ numbers. The last preallocated OST object FIDs on MDT0001 ican be seen via:

osp.testfs-OST0001-osc-MDT0001.prealloc_last_id=5761
osp.testfs-OST0001-osc-MDT0001.prealloc_last_seq=0x2c0000400
osp.testfs-OST0001-osc-MDT0001.prealloc_next_id=5730
osp.testfs-OST0001-osc-MDT0001.prealloc_next_seq=0x2c0000400

but they should be linked to the seq.*.fid entries.

On the regular Lustre client (which is itself a SEQ client of MDT0000's client of the master SEQ controller), this appears to be connected up and working properly, so it may be useful to look at the client MDC code:

seq.srv-testfs-MDT0001.server=testfs-MDT0000_UUID
seq.srv-testfs-MDT0001.space=[0x240004a50 - 0x280000400]:1:mdt
seq.srv-testfs-MDT0001.width=1
seq.cli-testfs-MDT0001.server=srv-testfs-MDT0001
seq.cli-testfs-MDT0001.space=[0x240004282 - 0x240004282]:0:mdt
seq.cli-testfs-MDT0001.width=131072
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.fid=[0x240004282:0x7:0x0]
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.server=testfs-MDT0001_UUID
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.space=[0x240004283 - 0x240004283]:0:mdt
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=131072
# lfs setstripe -i 1 /mnt/testfs/dir1/bar
# lfs getstripe -vy /mnt/testfs/dir1/bar
lmm_fid:           0x240004282:0x7:0x0
lmm_objects:
      - l_ost_idx: 1
        l_fid:     0x2c0000400:0x1662:0x0

This shows the client created a file on testfs-MDT0001 (because of dirstriping on dir1), using a FID that it was assigned from testfs-MDT0001 for the inode, and the OST object that testfs-MDT0001 used was assigned to it by testfs-OST0001 (0x1662 = 5730).

# lctl set_param seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=8
seq.cli-cli-testfs-MDT0000-mdc-ffff9f3ad3991800.width=8
# lfs setstripe -i 1 /mnt/testfs/dir1/bar /mnt/testfs/dir1/bag
[root@centos7 lustre-copy]# lfs getstripe -vy /mnt/testfs/dir1/ba*
/mnt/testfs/dir1/bag: [0x240004283:0x1:0x0]
/mnt/testfs/dir1/baz: [0x240004282:0x8:0x0]
/mnt/testfs/dir1/bar: [0x240004282:0x7:0x0]
# lctl get_param seq.cli-cli-testfs-MDT0000-mdc-ff*.*
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.fid=[0x240004283:0x1:0x0]
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.server=testfs-MDT0001_UUID
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.space=[0x240004284 - 0x24000428]:0:mdt
seq.cli-cli-testfs-MDT0001-mdc-ffff9f3ad3991800.width=8

This shows that on the regular MDC client, if I reduce the width and create new objects, the FID SEQ is rolled over to the next sequence, and the MDC client prefetches a new SEQ from its server (which happens to be the next SEQ number, but would not normally happen when there are many clients connected).

Comment by Gerrit Updater [ 25/Nov/21 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45659
Subject: LU-11912 fid: clean up OBIF_MAX_OID and IDIF_MAX_OID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 09ea80fb3fb6d9c82eded23e0e9088e90dd30901

Comment by Gerrit Updater [ 17/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45659/
Subject: LU-11912 fid: clean up OBIF_MAX_OID and IDIF_MAX_OID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb2f0dac868cf1321277bc3d7d6fc71f016d921b

Comment by Gerrit Updater [ 28/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/38424/
Subject: LU-11912 ofd: reduce LUSTRE_DATA_SEQ_MAX_WIDTH
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0ecb2a167c56ffff8e4fcb5cf576fb8c5d9e64fe

Comment by Peter Jones [ 29/Mar/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 30/Mar/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50478
Subject: LU-11912 tests: Adjust SEQ width according to OST count
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 04d876e6309b3c43b07f8b51c481977a417bf55d

Comment by Oleg Drokin [ 07/Apr/23 ]

Looks like landings in this patch cause new crash in full testing, filed LU-16720 for it.

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50478/
Subject: LU-11912 tests: SEQ rollover fixes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2fdb1f8d01b9f55f8270b48edc0e105e40d42f55

Comment by Gerrit Updater [ 13/Jun/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51292
Subject: LU-11912 tests: consume precreated objects in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fff18a8adbd47d6e9e798ebbb68b93ce531b08dc

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51292/
Subject: LU-11912 tests: consume precreated objects in parallel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 656fc937cfd3fc3b65cb21a7f93a6bd4cc07fc0e

Comment by Gerrit Updater [ 18/Oct/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52741
Subject: LU-11912 tests: force new seq in runtests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f1b7df40ab532db2472ed06784f6d4189e91b005

Comment by Gerrit Updater [ 23/Oct/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52801
Subject: LU-11912 tests: fix racing in force_new_seq_all
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: af6dcd597d7f5134de553349c05091e51e0f3dd6

Comment by Gerrit Updater [ 03/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52741/
Subject: LU-11912 tests: force new seq in runtests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7d62340b3cbc49bae49dd2fa8d5e7a0a8e3c1b2e

Comment by Gerrit Updater [ 03/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52801/
Subject: LU-11912 tests: fix racing in force_new_seq_all
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f8318e983f1925c7b9f367c270593233b956dff

Generated at Sat Feb 10 02:48:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.