Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16750

optimize ldiskfs internal metadata allocation for hybrid storage LUNs

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 9223372036854775807

    Description

      With hybrid storage LUNs (combined HDD + SSD, or QLC+TLC flash) it is desirable to be able to separate ldiskfs metadata allocations (that need small random IOs) from data allocations (that are better suited for large sequential IOs) depending on the type of underlying storage. With LVM it is possible to create an LV with SSD storage at the beginning of the LV, and HDD storage at the end of the LV. Between 0.5-1% of the OST capacity would need to be high-IOPS storage in order to hold all of the internal ldiskfs metadata.

      This would improve performance for inode and other metadata access, such as ls -l, (lfs) find, e2fsck, and in general file access latency, modification, truncate, unlink, transaction commit, etc.

      For mke2fs, the following options look interesting for hybrid storage, so that all of the static ldiskfs metadata (group descriptors, block/inode bitmaps, inode tables, journal) is located at the start of the device in the (fast) flash region:

      mkfs.lustre --mgsname testfs --ost --index=0 --mkfsoptions="-O sparse_super2 -E num_backup_sb=2,packed_meta_blocks=1" ... /dev/vgost0/lvost0
      
      sparse_super2
            This feature indicates that there will only be at most two
            backup superblocks and block group descriptors.   The block
            groups used to store the backup superblock(s) and blockgroup
            descriptor(s) are stored in the superblock, but typically, one
            will be located at the beginning of block group #1, and one in
            the last block group in the file system.  This feature is essentially
            a more extreme version of sparse_super and is designed to
            allow a much larger percentage of the disk to have contiguous
            blocks available for data files.
      
      num_backup_sb=<0,1,2>
            If the sparse_super2 file system feature is enabled
            this option controls whether there will be 0, 1, or
            2 backup superblocks created in the file system.
      
      packed_meta_blocks=<0,1>
            Place  the allocation bitmaps and the inode table at
            the beginning of the disk.  This option requires
            that the flex_bg file system feature to be enabled
            in order for it to have effect, and will also create
            the journal at the beginning of the file system.
            This option is useful for flash devices that use SLC
            flash at the beginning of the disk.  It also maximizes
            the range of contiguous data blocks, which can be
            be useful for certain specialized use cases, such as
            supported Shingled Drives.
      

      Unfortunately, there is not (yet) any mechanism to force dynamic metadata (directory blocks, indirect/index blocks, xattr blocks) to be allocated in the fast region at the start of the device. It makes sense for mke2fs and/or tune2fs to be able to mark "fast" groups in the group descriptor with a flag, like:

      #define EXT4_BG_IOPS     0x0010
      

      (note that EXT4_BG_WAS_TRIMMED = 0x0008 is tentatively reserved).

      This could be set at format time (e.g. "-E iops=0-1024G,4096-8192G" or similar to indicate where the "IOPS" storage lived), but since it is a per-group field, it could also be used at a 128MB granularity for more arbitrary separation of "IOPS" vs. "slow" storage (e.g. add "IOPS" storage at the end of the device, or interleaved in smaller or larger chunks in case of filesystem resize after creation).

      The mballoc code could then use the IOPS flag in the group descriptor to decide which groups to allocate dynamic filesystem metadata, which prefers high-IOPS storage. Since the block allocator knows that the storage is IOPS oriented, it can make these (mostly individual) block allocations densely-packed rather than trying to align large allocations.

      Having separate block groups for IOPS allocations will also isolate the non-IOPS groups from having such allocations, better allowing it to do large streaming read/write operations, similar to the benefits seen with DoM + HDD OSTs at the Lustre file level, but without the runtime/Lustre layout complexity.

      For the new mballoc list-based allocator (LU-12970) the presence of groups marked IOPS would be best handled by creating a second size-array of list_heads sorting the IOPS groups by free blocks size. Then, when doing a block allocation for a directory, or an indirect/index block, or an xattr block, mballoc can look into the IOPS array instead of the regular array. The fact that these metadata blocks are not close to the referencing inodes is mostly irrelevant, since they are on a different block device, and (by nature of the underlying storage) have no seek latency.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: