[LU-16305] mkfs.lustre fails on devices between 16TiB-32GiB and 16TiB-1B - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic/Theme:
- ldiskfs
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Attempting to mkfs.lustre a disk below 16TiB in size but above the value computed for "resize" by mkfs.lustre (for 4KiB blocks, 32GiB shy of 16TiB) results in a failure as mke2fs requires resize to be greater than the specified capacity.

Example:
mkfs.lustre --ost --reformat --servicenode <elided> --fsname=lustrefs --index 1 --mgsnode <elided> --backfstype=ldiskfs /dev/ost1
mkfs.lustre FATAL: Unable to build fs /dev/ost1 (256)mkfs.lustre FATAL: mkfs failed 256 Permanent disk data:
Target: lustrefs:OST0001
Index: 1
Lustre FS: lustrefs
Mount type: ldiskfs
Flags: 0x1062
(OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=<elided> mgsnode=<elided>device size = 16777152MB
formatting backing filesystem ldiskfs on /dev/ost1
target name lustrefs:OST0001
kilobytes 17179803648
options -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F
mkfs_cmd = mke2fs -j -b 4096 -L lustrefs:OST0001 -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F /dev/ost1 17179803648k
detected raid stride 4194304 too large, use optimum 512
detected raid stripe width 67108864 too large, use optimum 512
The resize maximum must be greater than the filesystem size.

Bad option(s) specified: Extended options are separated by commas, and may take an argument which
is set off by an equals ('=') sign. Valid extended options are:
mmp_update_interval=<interval>
num_backup_sb=<0|1|2>
stride=<RAID per-disk data chunk in blocks>
stripe-width=<RAID stride * data disks in blocks>
offset=<offset to create the file system>
resize=<resize maximum size in blocks>
packed_meta_blocks=<0 to disable, 1 to enable>
lazy_itable_init=<0 to disable, 1 to enable>
lazy_journal_init=<0 to disable, 1 to enable>
root_owner=<uid of root dir>:<gid of root dir>
test_fs
discard
nodiscard
encoding=<encoding>
encoding_flags=<flags>
quotatype=<quota type(s) to be enabled>

Attachments

Issue Links

duplicates

LU-17036 mkfs.lustre fails with "The resize maximum must be greater than the filesystem size"

Resolved

Activity

[LU-16305] mkfs.lustre fails on devices between 16TiB-32GiB and 16TiB-1B

Ellis Wilson added a comment - 16/Apr/24 3:34 PM

Roger. Will keep that in mind going forward.

Ellis Wilson added a comment - 16/Apr/24 3:34 PM Roger. Will keep that in mind going forward.

Andreas Dilger added a comment - 14/Apr/24 10:05 AM

We typically use "Resolved" instead of "Closed" so that it is still possible to do things like add labels and make other changes to the ticket.

Andreas Dilger added a comment - 14/Apr/24 10:05 AM We typically use "Resolved" instead of "Closed" so that it is still possible to do things like add labels and make other changes to the ticket.

Ellis Wilson added a comment - 11/Apr/24 2:12 PM

~~LU-17036~~ identified and fixed the same problem as this one. Closing this out.

Ellis Wilson added a comment - 11/Apr/24 2:12 PM LU-17036 identified and fixed the same problem as this one. Closing this out.

Peter Jones added a comment - 13/Nov/22 8:26 PM

elliswilson I have added you to the developers group for the community project so you should now be able to do things like assign tickets to yourself etc

Peter Jones added a comment - 13/Nov/22 8:26 PM elliswilson I have added you to the developers group for the community project so you should now be able to do things like assign tickets to yourself etc

Ellis Wilson added a comment - 10/Nov/22 4:47 PM

Thanks for the clarification Andreas. I've revised my in-house fix, and will run it through the steps on your submitting changes wiki shortly.

Ellis Wilson added a comment - 10/Nov/22 4:47 PM Thanks for the clarification Andreas. I've revised my in-house fix, and will run it through the steps on your submitting changes wiki shortly.

Andreas Dilger added a comment - 10/Nov/22 6:50 AM

The resize_inode feature only works up to 16TB, so it is basically useless for the problematic filesystem and may as well be disabled for such filesystems. There is a different feature (meta_bg) that is used for resizing filesystems beyond 16TB. The 1024x resize is based on a starting filesystem size that is much smaller.

Yes, the ext4 metadata is not aligned to 1MB boundaries by default, and this option (along with some others added in the same patch) ensures that other metadata was located with proper 1MB alignment for HDD RAID alignment. That is not so important for flash MDTs at this point either.

So my approach to fixing this issue would be to disable the resize_inode feature (if this isn't done automatically already) and not specify the "-E resize=nnnn" option for filesystems that are close to 16TB in size.

Andreas Dilger added a comment - 10/Nov/22 6:50 AM The resize_inode feature only works up to 16TB, so it is basically useless for the problematic filesystem and may as well be disabled for such filesystems. There is a different feature ( meta_bg ) that is used for resizing filesystems beyond 16TB. The 1024x resize is based on a starting filesystem size that is much smaller. Yes, the ext4 metadata is not aligned to 1MB boundaries by default, and this option (along with some others added in the same patch) ensures that other metadata was located with proper 1MB alignment for HDD RAID alignment. That is not so important for flash MDTs at this point either. So my approach to fixing this issue would be to disable the resize_inode feature (if this isn't done automatically already) and not specify the " -E resize=nnnn " option for filesystems that are close to 16TB in size.

Ellis Wilson added a comment - 10/Nov/22 1:47 AM

I believe this only applies to OSTs, and while I can disable it I'd like to better understand what the optimization is attempting to accomplish first. I think you put this block in around 2011 (could totally be wrong – it's moved around a few times). Do you remember what it was accomplishing? I'm really struggling to understand this comment block:
871 /* In order to align the filesystem metadata on 1MB boundaries,
872 * give a resize value that will reserve a power-of-two group
873 * descriptor blocks, but leave one block for the superblock.
874 * Only useful for filesystems with < 2^32 blocks due to resize
875 * limitations.

Is ext metadata really unaligned without specifying resize? Some docs suggest that without giving this, mke2fs plans for up to 1024 times the original size of the filesystem, so I don't feel like this is a case where we're trying to plan ahead more than mke2fs already does.

Ellis Wilson added a comment - 10/Nov/22 1:47 AM I believe this only applies to OSTs, and while I can disable it I'd like to better understand what the optimization is attempting to accomplish first. I think you put this block in around 2011 (could totally be wrong – it's moved around a few times). Do you remember what it was accomplishing? I'm really struggling to understand this comment block: 871 /* In order to align the filesystem metadata on 1MB boundaries, 872 * give a resize value that will reserve a power-of-two group 873 * descriptor blocks, but leave one block for the superblock. 874 * Only useful for filesystems with < 2^32 blocks due to resize 875 * limitations. Is ext metadata really unaligned without specifying resize? Some docs suggest that without giving this, mke2fs plans for up to 1024 times the original size of the filesystem, so I don't feel like this is a case where we're trying to plan ahead more than mke2fs already does.

Andreas Dilger added a comment - 10/Nov/22 12:22 AM

Rather than shrink the MDT device, it would be better to just disable the resize_inode feature for such filesystems, since it is not useful for filesystems over 16TiB anyway.

Andreas Dilger added a comment - 10/Nov/22 12:22 AM Rather than shrink the MDT device, it would be better to just disable the resize_inode feature for such filesystems, since it is not useful for filesystems over 16TiB anyway.

Ellis Wilson added a comment - 09/Nov/22 5:34 PM - edited

No problem! I fixed it with the following patch (going through the mechanics presently to test/submit the patch):

--- a/lustre/utils/libmount_utils_ldiskfs.c
+++ b/lustre/utils/libmount_utils_ldiskfs.c
@@ -885,6 +885,15 @@ int ldiskfs_make_lustre(struct mkfs_opts *mop)
                append_unique(start, ext_opts ? "," : " -E ",
                              "resize", buf, maxbuflen);
                ext_opts = 1;
+
+               /* The resize maximum must be greater than filesystem size, but for disks
+                * or arrays just shy of 16TiB you can get into a situation where capacity
+                * is between resize_blks and 16TiB.    Shrink the drive size to 1MiB less
+                * than resize in these scenarios (at most ~0.1% capacity is lost). 
+                */
+               if (resize_blks <= mop->mo_device_kb / mop->mo_blocksize_kb) {
+                       mop->mo_device_kb = (long long)(resize_blks) * (long long)mop->mo_blocksize_kb - 1024;
+               }
        }
 
        /* Avoid zeroing out the full journal - speeds up mkfs */

{{

As posted on lustre-discuss, I have some questions about the intent behind resize, and IDK how to get this assigned to me (maybe that's only for WC people).

Ellis Wilson added a comment - 09/Nov/22 5:34 PM - edited No problem! I fixed it with the following patch (going through the mechanics presently to test/submit the patch): --- a/lustre/utils/libmount_utils_ldiskfs.c +++ b/lustre/utils/libmount_utils_ldiskfs.c @@ -885,6 +885,15 @@ int ldiskfs_make_lustre(struct mkfs_opts *mop) append_unique(start, ext_opts ? "," : " -E ", "resize", buf, maxbuflen); ext_opts = 1; + + /* The resize maximum must be greater than filesystem size, but for disks + * or arrays just shy of 16TiB you can get into a situation where capacity + * is between resize_blks and 16TiB. Shrink the drive size to 1MiB less + * than resize in these scenarios (at most ~0.1% capacity is lost). + */ + if (resize_blks <= mop->mo_device_kb / mop->mo_blocksize_kb) { + mop->mo_device_kb = (long long)(resize_blks) * (long long)mop->mo_blocksize_kb - 1024; + } } /* Avoid zeroing out the full journal - speeds up mkfs */ {{ As posted on lustre-discuss, I have some questions about the intent behind resize, and IDK how to get this assigned to me (maybe that's only for WC people).

Andreas Dilger added a comment - 09/Nov/22 5:11 PM - edited

Ellis, thanks for filing the ticket. Until this is fixed in the code, it should be possible to work around the issue by adding ",^resize_inode" to the "-O" feature list, and removing "resize=4290772992," from the "-E" extended options list on the mkfs.lustre command line.

Andreas Dilger added a comment - 09/Nov/22 5:11 PM - edited Ellis, thanks for filing the ticket. Until this is fixed in the code, it should be possible to work around the issue by adding " ,^resize_inode " to the " -O " feature list, and removing " resize=4290772992, " from the " -E " extended options list on the mkfs.lustre command line.

People

Assignee:: Ellis Wilson

Reporter:: Ellis Wilson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Nov/22 4:47 PM

Updated:: 16/Apr/24 3:34 PM

Resolved:: 14/Apr/24 10:05 AM