[LU-16750] optimize ldiskfs internal metadata allocation for hybrid storage LUNs Created: 20/Apr/23  Updated: 18/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: ldiskfs

Issue Links:
Duplicate
Related
is related to LU-14438 backport ldiskfs mballoc patches Open
is related to LU-14712 improve ldiskfs "-o discard" performance Open
is related to LU-15002 Allocate metagroup descriptors contin... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

With hybrid storage LUNs (combined HDD + SSD, or QLC+TLC flash) it is desirable to be able to separate ldiskfs metadata allocations (that need small random IOs) from data allocations (that are better suited for large sequential IOs) depending on the type of underlying storage. With LVM it is possible to create an LV with SSD storage at the beginning of the LV, and HDD storage at the end of the LV. Between 0.5-1% of the OST capacity would need to be high-IOPS storage in order to hold all of the internal ldiskfs metadata.

This would improve performance for inode and other metadata access, such as ls -l, (lfs) find, e2fsck, and in general file access latency, modification, truncate, unlink, transaction commit, etc.

For mke2fs, the following options look interesting for hybrid storage, so that all of the static ldiskfs metadata (group descriptors, block/inode bitmaps, inode tables, journal) is located at the start of the device in the (fast) flash region:

mkfs.lustre --mgsname testfs --ost --index=0 --mkfsoptions="-O sparse_super2 -E num_backup_sb=2,packed_meta_blocks=1" ... /dev/vgost0/lvost0

sparse_super2
      This feature indicates that there will only be at most two
      backup superblocks and block group descriptors.   The block
      groups used to store the backup superblock(s) and blockgroup
      descriptor(s) are stored in the superblock, but typically, one
      will be located at the beginning of block group #1, and one in
      the last block group in the file system.  This feature is essentially
      a more extreme version of sparse_super and is designed to
      allow a much larger percentage of the disk to have contiguous
      blocks available for data files.

num_backup_sb=<0,1,2>
      If the sparse_super2 file system feature is enabled
      this option controls whether there will be 0, 1, or
      2 backup superblocks created in the file system.

packed_meta_blocks=<0,1>
      Place  the allocation bitmaps and the inode table at
      the beginning of the disk.  This option requires
      that the flex_bg file system feature to be enabled
      in order for it to have effect, and will also create
      the journal at the beginning of the file system.
      This option is useful for flash devices that use SLC
      flash at the beginning of the disk.  It also maximizes
      the range of contiguous data blocks, which can be
      be useful for certain specialized use cases, such as
      supported Shingled Drives.

Unfortunately, there is not (yet) any mechanism to force dynamic metadata (directory blocks, indirect/index blocks, xattr blocks) to be allocated in the fast region at the start of the device. It makes sense for mke2fs and/or tune2fs to be able to mark "fast" groups in the group descriptor with a flag, like:

#define EXT4_BG_IOPS     0x0010

(note that EXT4_BG_WAS_TRIMMED = 0x0008 is tentatively reserved).

This could be set at format time (e.g. "-E iops=0-1024G,4096-8192G" or similar to indicate where the "IOPS" storage lived), but since it is a per-group field, it could also be used at a 128MB granularity for more arbitrary separation of "IOPS" vs. "slow" storage (e.g. add "IOPS" storage at the end of the device, or interleaved in smaller or larger chunks in case of filesystem resize after creation).

The mballoc code could then use the IOPS flag in the group descriptor to decide which groups to allocate dynamic filesystem metadata, which prefers high-IOPS storage. Since the block allocator knows that the storage is IOPS oriented, it can make these (mostly individual) block allocations densely-packed rather than trying to align large allocations.

Having separate block groups for IOPS allocations will also isolate the non-IOPS groups from having such allocations, better allowing it to do large streaming read/write operations, similar to the benefits seen with DoM + HDD OSTs at the Lustre file level, but without the runtime/Lustre layout complexity.

For the new mballoc list-based allocator (LU-12970) the presence of groups marked IOPS would be best handled by creating a second size-array of list_heads sorting the IOPS groups by free blocks size. Then, when doing a block allocation for a directory, or an indirect/index block, or an xattr block, mballoc can look into the IOPS array instead of the regular array. The fact that these metadata blocks are not close to the referencing inodes is mostly irrelevant, since they are on a different block device, and (by nature of the underlying storage) have no seek latency.



 Comments   
Comment by Andreas Dilger [ 13/Jun/23 ]

It is expected that the size of the IOPS storage is relatively small compared to the non-IOPS storage. About 0.5% is enough to hold the static metadata (inode tables, bitmaps, etc.) plus enough extra space for dynamically allocated metadata (directory blocks, indirect/index blocks, xattr blocks unless there are many large xattrs). As such, it makes sense to reserve the IOPS storage exclusively for metadata usage, and the non-IOPS storage should be preferred for data unless there is no free IOPS space.

The IOPS storage should not normally be used for data, but it makes sense to have a tunable parameter (e.g. /sys/fs/ext4/sdX/iops_free_threshold or similar) that controls at what percentage of free space the IOPS groups could be used for data allocations. Normally this would be =0, meaning the IOPS space should never be used for data, but it could be set to e.g. 1% or 5% (or whatever) free (e.g. when filesystem is above 99% or 95% full) if there is a lot of IOPS space and the administrator really wants to use it for data.

Comment by Gerrit Updater [ 11/Jul/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51625
Subject: LU-16750 ldiskfs: optimize metadata allocation for hybrid LUNs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9e31dd0512bbb10bea5bf093ed607222b84f782

Comment by Gerrit Updater [ 21/Jul/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/51735
Subject: LU-16750 mke2fs: add "-E iops" to set IOPS storage group
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 177299183e81be66b1c8ead4755357452e87f8a2

Comment by Gerrit Updater [ 07/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/51735/
Subject: LU-16750 mke2fs: add "-E iops" to set IOPS storage group
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 7ac1b50954cb02d2db18ce462b83ef4ba653b0dc

Comment by Gerrit Updater [ 25/Aug/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52091
Subject: LU-16750 tune2fs: add "-E iops" to set/clear IOPS storage group
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 3f5b37336ca396128352512878de87f65cd07193

Comment by Gerrit Updater [ 31/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52091/
Subject: LU-16750 tune2fs: add "-E iops" to set/clear IOPS groups
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: a59ac3441448d61d66880e2e5329585191c98716

Generated at Sat Feb 10 03:29:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.