[LU-15002] Allocate metagroup descriptors continuously if possible Created: 11/Sep/21  Updated: 24/Oct/23  Resolved: 16/Oct/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16750 optimize ldiskfs internal metadata al... Open
Epic/Theme: ldiskfs
Rank (Obsolete): 9223372036854775807

 Description   

If LDISKFS target formatted with meta_bg option, then group descriptors are split across a target. Pre-reading optimization doesn't work for such metadata.

With the current cluster size (no bigalloc option) a partition >256TB can not be created without meta_bg, but there is a workaround.

Filesystems can either be created using this new block group descriptor layout, or existing filesystems can be resized online, and the field s_first_meta_bg in the superblock will indicate the first block group using this new layout.

The next steps allow to the creation of continuous group descriptors for the first 256TB and use meta_bg for all other groups.

      1. Create < 256 TB partition without the meta_bg flag

      2. Extend the partition to the whole disk

These steps can be done manually or mkfs can be modified.



 Comments   
Comment by Andreas Dilger [ 10/May/23 ]

With the sparse_super2 option it is possible to avoid having multiple superblock and group descriptor backups, limiting the number of backups to 0, 1, or 2.

However, even with this option, it is not possible to avoid enabling the meta_bg feature for filesystems larger than 256TiB, but I think it would be possible to fix this.

The reason for this is that for filesystems above 256TiB, the primary group descriptor table in block group #0 (a 64-byte struct for each 128MiB group which stores the locations of the bitmaps and inode tables) itself consumes more than 128MiB (more than 2M block groups), and fills the whole first block group. At this point, any larger group descriptor would collide with the backup group descriptor, which is normally located in block group #1.

I also tested with num_backup_sb=0 so that there would not be a backup superblock or backup group descriptor table, but this doesn't stop the enabling of meta_bg and its spread of the group descriptor blocks across the filesystem. It looks possible with relatively little effort to have mke2fs store the backup superblock(s) in later block groups (instead of group #1) that would otherwise have a backup under normal conditions (e.g. one of group #3, #5, #7, #9, #25, #27, #49, #81, #125, ...).

While it is possible to arbitrarily locate the backup superblock and group descriptors in any group (there is the s_backup_bgs[2] field for storing their location), this would not be helpful if the primary superblock itself is corrupted. Sticking with "traditional" block groups for the superblock backups (powers of 3, 5, 7) makes it much easier for e2fsck and other tools to locate the backup in case the primary superblock is corrupted.

Once sparse_super2 moves the group descriptor backup out of group #1, then it should be possible for the group descriptor table to exceed 128MiB in size. Using group #5 for the backup would allow group descriptors for up to 1.25 PiB, while using group #9 would be good for up to 2.25 PiB, but this could be adjusted arbitrarily as the filesystem grows.

Comment by Andreas Dilger [ 13/Jun/23 ]

The second sparse_super2 backup is normally stored in the last full block group. For very large OSTs (> 256TiB) the backup would not fit into that group, so should be located in the highest-numbered "traditional" sparse_super backup group number (there should be several choices given the large device size).

Comment by Gerrit Updater [ 13/Jun/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/51295
Subject: LU-15002 mke2fs: allow selecting sparse_super2 backup
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: b0ede71742d759e6ec90964c50fa1a19f5835d23

Comment by Andreas Dilger [ 28/Jun/23 ]

Two additional things need to be checked with the sparse_super2 option:

  • do we need to use the "-E packed_meta_blocks" option to get a good layout? This should be the default for sparse_super2, but I'm not positive.
  • is the journal located at the start of the device or somewhere else? I vaguely recall that the journal was moved to the middle of the device to minimize average seek time on HDDs.
Comment by Li Xi [ 13/Jul/23 ]

dongyanghongchao.zhang Would you please post the patches of E2fsprogs for this feature even they are not complete?

Comment by Dongyang Li [ 13/Jul/23 ]

Li Xi, the patch is tracked under https://review.whamcloud.com/c/tools/e2fsprogs/+/51295

Comment by Andreas Dilger [ 16/Jul/23 ]

There still needs to be a second patch that updates e2fsck to automatically find the backup superblock and group descriptors from a sparse_super group number. See my previous notes on this. 

Comment by Dongyang Li [ 17/Jul/23 ]

Andreas, finding the backup is added in patchset 4 of 51295.

Comment by Gerrit Updater [ 20/Jul/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51723
Subject: LU-15002 utils: disable meta_bg and enable packed_meta_blocks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8f6c0f1af16390ff357d57c092aaacee043f1921

Comment by Andreas Dilger [ 17/Aug/23 ]

It looks like filesystem resize is totally disallowed with the sparse_super2 feature due to:

commit b1489186cc8391e0c1e342f9fbc3eedf6b944c61
Author:     Josh Triplett <josh@joshtriplett.org>
AuthorDate: Mon Jun 7 12:15:24 2021 -0700
Commit:     Theodore Ts'o <tytso@mit.edu>
CommitDate: Thu Jun 24 10:22:36 2021 -0400

    ext4: add check to prevent attempting to resize an fs with sparse_super2
    
    The in-kernel ext4 resize code doesn't support filesystem with the
    sparse_super2 feature. It fails with errors like this and doesn't finish
    the resize:
    EXT4-fs (loop0): resizing filesystem from 16640 to 7864320 blocks
    EXT4-fs warning (device loop0): verify_reserved_gdb:760: reserved GDT 2 missing grp 1 (32770)
    EXT4-fs warning (device loop0): ext4_resize_fs:2111: error (-22) occurred during file system resize
    EXT4-fs (loop0): resized filesystem to 2097152
    
    To reproduce:
    mkfs.ext4 -b 4096 -I 256 -J size=32 -E resize=$((256*1024*1024)) -O sparse_super2 ext4.img 65M
    truncate -s 30G ext4.img
    mount ext4.img /mnt
    python3 -c 'import fcntl, os, struct ; fd = os.open("/mnt", os.O_RDONLY | os.O_DIRECTORY) ; fcntl.ioctl(fd, 0x40086610, struct.pack("Q", 30 * 1024 * 1024 * 1024 // 4096), False) ; os.close(fd)'
    dmesg | tail
    e2fsck ext4.img
    
    The userspace resize2fs tool has a check for this case: it checks if the
    filesystem has sparse_super2 set and if the kernel provides
    /sys/fs/ext4/features/sparse_super2. However, the former check requires
    manually reading and parsing the filesystem superblock.
    
    Detect this case in ext4_resize_begin and error out early with a clear
    error message.
    
    Signed-off-by: Josh Triplett <josh@joshtriplett.org>
    Link: https://lore.kernel.org/r/74b8ae78405270211943cd7393e65586c5faeed1.1623093259.git.josh@joshtriplett.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

I don't think this will be a huge problem for Lustre, since it is very rare to resize OSTs, but it would be good to backport that patch to ldiskfs so that the filesystem is not accidentally corrupted once sparse_super2 is enabled by default. This patch is already included in el7.9 and later.

Comment by Andreas Dilger [ 17/Aug/23 ]

It looks like it is not actually needed to totally disable online resize with sparse_super2. It looks like this is only a problem in the case where resize_inode is being used to reserve GDT blocks, because sparse_super2 will store GDT backups in various different places. The resize_inode will not be enabled for filesystems > 16TiB, so this should not actually be a problem under normal usage.

It also looks possible to fix the online resize code to handle sparse_super2 better by allowing resize_inode to store the GDT blocks after the current primary and two backup GDT copies for future resizing. That would ensure the resize kept all of the GDT blocks contiguous on disk, without having to e.g. move the block and inode bitmaps.

One of the reasons that resize_inode is not used above 16TiB is because it is stored in block-mapped format, and that couldn't reserve blocks > 2^32, but with sparse_super2 the primary GDT and backup GDT #1 would always be near the start of the device with block numbers < 2^32, and backup #2 GDT could be similarly constrained. The other reason that resize_inode wasn't used for huge filesystems is because the number of backup GDT blocks grows exponentially large as more backup groups are added to the filesystem, but this is also not a problem for sparse_super2 since it only has one backup.

Comment by Emoly Liu [ 31/Aug/23 ]

I did some simple tests on 18k-03 system to verify if this patch can improve mke2fs performance as we expect. The results are shown as follows:

 

mke2fs on OST0008
(977TB)
time(s)
original
(meta_bg,lazy_itable_init=1)
528
new
(sparse_super2,packed_meta_blocks,lazy_itable_init=1)
37
(almost no difference w or w/ "nodiscard")
new
(sparse_super2,packed_meta_blocks,lazy_itable_init=0)
too slow, about 10MB/s, unfinished--> 
3.8TB/s 260(s) finished with the patch at https://review.whamcloud.com/c/tools/e2fsprogs/+/52215

 

 

Comment by Gerrit Updater [ 01/Sep/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52215
Subject: LU-15002 mke2fs: batch zeroing inode table
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 37ac8287d28796446442e9c79f7b6827dceb08e9

Comment by Gerrit Updater [ 01/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52219
Subject: LU-15002 e2fsck: check all sparse_super backups
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: fd0dd693eaf524fb50ed31288214e9a7a8a8648f

Comment by Gerrit Updater [ 04/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52219/
Subject: LU-15002 e2fsck: check all sparse_super backups
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: eb67ae2ec450eb7ea87d533bd24cc67340597fe6

Comment by Gerrit Updater [ 05/Sep/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52273
Subject: LU-15002 mke2fs: set free blocks accurately for groups has GDT
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: afd1ed663877910ae8c63392830a3958b95e0afd

Comment by Gerrit Updater [ 05/Sep/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52274
Subject: LU-15002 mke2fs: do not set the BLOCK_UNINIT on groups has GDT
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 0fcb849941969f9553e43872800bd388689fced9

Comment by Gerrit Updater [ 22/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52215/
Subject: LU-15002 mke2fs: batch zeroing inode table
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: d7e9c047d4090abedc88e707da21b600640c70c3

Comment by Gerrit Updater [ 25/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52273/
Subject: LU-15002 mke2fs: set free blocks accurately for groups has GDT
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 6e18f9e2c1ae7300aade5f3b2d8b65b9f5e64fc3

Comment by Gerrit Updater [ 25/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52274/
Subject: LU-15002 mke2fs: do not set the BLOCK_UNINIT on groups has GDT
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 03a6a831ce5d78e024c03a21f7de88b14519ef99

Comment by Gerrit Updater [ 25/Sep/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/51295/
Subject: LU-15002 mke2fs: try to pack the GDT blocks together
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 27d5daaad86a70a54e579131b55b637c7e952cf5

Comment by Gerrit Updater [ 16/Oct/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51723/
Subject: LU-15002 utils: disable meta_bg and enable packed_meta_blocks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7cce9f2d1c0911ee8501f08da6b6573735dee70e

Comment by Peter Jones [ 16/Oct/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:14:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.