[LU-15002] Allocate metagroup descriptors continuously if possible Created: 11/Sep/21 Updated: 24/Oct/23 Resolved: 16/Oct/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Artem Blagodarenko (Inactive) | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Epic/Theme: | ldiskfs | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
If LDISKFS target formatted with meta_bg option, then group descriptors are split across a target. Pre-reading optimization doesn't work for such metadata. With the current cluster size (no bigalloc option) a partition >256TB can not be created without meta_bg, but there is a workaround. Filesystems can either be created using this new block group descriptor layout, or existing filesystems can be resized online, and the field s_first_meta_bg in the superblock will indicate the first block group using this new layout. The next steps allow to the creation of continuous group descriptors for the first 256TB and use meta_bg for all other groups. 1. Create < 256 TB partition without the meta_bg flag 2. Extend the partition to the whole disk These steps can be done manually or mkfs can be modified. |
| Comments |
| Comment by Andreas Dilger [ 10/May/23 ] | ||||||||
|
With the sparse_super2 option it is possible to avoid having multiple superblock and group descriptor backups, limiting the number of backups to 0, 1, or 2. However, even with this option, it is not possible to avoid enabling the meta_bg feature for filesystems larger than 256TiB, but I think it would be possible to fix this. The reason for this is that for filesystems above 256TiB, the primary group descriptor table in block group #0 (a 64-byte struct for each 128MiB group which stores the locations of the bitmaps and inode tables) itself consumes more than 128MiB (more than 2M block groups), and fills the whole first block group. At this point, any larger group descriptor would collide with the backup group descriptor, which is normally located in block group #1. I also tested with num_backup_sb=0 so that there would not be a backup superblock or backup group descriptor table, but this doesn't stop the enabling of meta_bg and its spread of the group descriptor blocks across the filesystem. It looks possible with relatively little effort to have mke2fs store the backup superblock(s) in later block groups (instead of group #1) that would otherwise have a backup under normal conditions (e.g. one of group #3, #5, #7, #9, #25, #27, #49, #81, #125, ...). While it is possible to arbitrarily locate the backup superblock and group descriptors in any group (there is the s_backup_bgs[2] field for storing their location), this would not be helpful if the primary superblock itself is corrupted. Sticking with "traditional" block groups for the superblock backups (powers of 3, 5, 7) makes it much easier for e2fsck and other tools to locate the backup in case the primary superblock is corrupted. Once sparse_super2 moves the group descriptor backup out of group #1, then it should be possible for the group descriptor table to exceed 128MiB in size. Using group #5 for the backup would allow group descriptors for up to 1.25 PiB, while using group #9 would be good for up to 2.25 PiB, but this could be adjusted arbitrarily as the filesystem grows. | ||||||||
| Comment by Andreas Dilger [ 13/Jun/23 ] | ||||||||
|
The second sparse_super2 backup is normally stored in the last full block group. For very large OSTs (> 256TiB) the backup would not fit into that group, so should be located in the highest-numbered "traditional" sparse_super backup group number (there should be several choices given the large device size). | ||||||||
| Comment by Gerrit Updater [ 13/Jun/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/51295 | ||||||||
| Comment by Andreas Dilger [ 28/Jun/23 ] | ||||||||
|
Two additional things need to be checked with the sparse_super2 option:
| ||||||||
| Comment by Li Xi [ 13/Jul/23 ] | ||||||||
|
dongyanghongchao.zhang Would you please post the patches of E2fsprogs for this feature even they are not complete? | ||||||||
| Comment by Dongyang Li [ 13/Jul/23 ] | ||||||||
|
Li Xi, the patch is tracked under https://review.whamcloud.com/c/tools/e2fsprogs/+/51295 | ||||||||
| Comment by Andreas Dilger [ 16/Jul/23 ] | ||||||||
|
There still needs to be a second patch that updates e2fsck to automatically find the backup superblock and group descriptors from a sparse_super group number. See my previous notes on this. | ||||||||
| Comment by Dongyang Li [ 17/Jul/23 ] | ||||||||
|
Andreas, finding the backup is added in patchset 4 of 51295. | ||||||||
| Comment by Gerrit Updater [ 20/Jul/23 ] | ||||||||
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51723 | ||||||||
| Comment by Andreas Dilger [ 17/Aug/23 ] | ||||||||
|
It looks like filesystem resize is totally disallowed with the sparse_super2 feature due to: commit b1489186cc8391e0c1e342f9fbc3eedf6b944c61
Author: Josh Triplett <josh@joshtriplett.org>
AuthorDate: Mon Jun 7 12:15:24 2021 -0700
Commit: Theodore Ts'o <tytso@mit.edu>
CommitDate: Thu Jun 24 10:22:36 2021 -0400
ext4: add check to prevent attempting to resize an fs with sparse_super2
The in-kernel ext4 resize code doesn't support filesystem with the
sparse_super2 feature. It fails with errors like this and doesn't finish
the resize:
EXT4-fs (loop0): resizing filesystem from 16640 to 7864320 blocks
EXT4-fs warning (device loop0): verify_reserved_gdb:760: reserved GDT 2 missing grp 1 (32770)
EXT4-fs warning (device loop0): ext4_resize_fs:2111: error (-22) occurred during file system resize
EXT4-fs (loop0): resized filesystem to 2097152
To reproduce:
mkfs.ext4 -b 4096 -I 256 -J size=32 -E resize=$((256*1024*1024)) -O sparse_super2 ext4.img 65M
truncate -s 30G ext4.img
mount ext4.img /mnt
python3 -c 'import fcntl, os, struct ; fd = os.open("/mnt", os.O_RDONLY | os.O_DIRECTORY) ; fcntl.ioctl(fd, 0x40086610, struct.pack("Q", 30 * 1024 * 1024 * 1024 // 4096), False) ; os.close(fd)'
dmesg | tail
e2fsck ext4.img
The userspace resize2fs tool has a check for this case: it checks if the
filesystem has sparse_super2 set and if the kernel provides
/sys/fs/ext4/features/sparse_super2. However, the former check requires
manually reading and parsing the filesystem superblock.
Detect this case in ext4_resize_begin and error out early with a clear
error message.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/74b8ae78405270211943cd7393e65586c5faeed1.1623093259.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
I don't think this will be a huge problem for Lustre, since it is very rare to resize OSTs, but it would be good to backport that patch to ldiskfs so that the filesystem is not accidentally corrupted once sparse_super2 is enabled by default. | ||||||||
| Comment by Andreas Dilger [ 17/Aug/23 ] | ||||||||
|
It looks like it is not actually needed to totally disable online resize with sparse_super2. It looks like this is only a problem in the case where resize_inode is being used to reserve GDT blocks, because sparse_super2 will store GDT backups in various different places. The resize_inode will not be enabled for filesystems > 16TiB, so this should not actually be a problem under normal usage. It also looks possible to fix the online resize code to handle sparse_super2 better by allowing resize_inode to store the GDT blocks after the current primary and two backup GDT copies for future resizing. That would ensure the resize kept all of the GDT blocks contiguous on disk, without having to e.g. move the block and inode bitmaps. One of the reasons that resize_inode is not used above 16TiB is because it is stored in block-mapped format, and that couldn't reserve blocks > 2^32, but with sparse_super2 the primary GDT and backup GDT #1 would always be near the start of the device with block numbers < 2^32, and backup #2 GDT could be similarly constrained. The other reason that resize_inode wasn't used for huge filesystems is because the number of backup GDT blocks grows exponentially large as more backup groups are added to the filesystem, but this is also not a problem for sparse_super2 since it only has one backup. | ||||||||
| Comment by Emoly Liu [ 31/Aug/23 ] | ||||||||
|
I did some simple tests on 18k-03 system to verify if this patch can improve mke2fs performance as we expect. The results are shown as follows:
| ||||||||
| Comment by Gerrit Updater [ 01/Sep/23 ] | ||||||||
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52215 | ||||||||
| Comment by Gerrit Updater [ 01/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52219 | ||||||||
| Comment by Gerrit Updater [ 04/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52219/ | ||||||||
| Comment by Gerrit Updater [ 05/Sep/23 ] | ||||||||
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52273 | ||||||||
| Comment by Gerrit Updater [ 05/Sep/23 ] | ||||||||
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/52274 | ||||||||
| Comment by Gerrit Updater [ 22/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52215/ | ||||||||
| Comment by Gerrit Updater [ 25/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52273/ | ||||||||
| Comment by Gerrit Updater [ 25/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/52274/ | ||||||||
| Comment by Gerrit Updater [ 25/Sep/23 ] | ||||||||
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/51295/ | ||||||||
| Comment by Gerrit Updater [ 16/Oct/23 ] | ||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51723/ | ||||||||
| Comment by Peter Jones [ 16/Oct/23 ] | ||||||||
|
Landed for 2.16 |