[LU-16305] mkfs.lustre fails on devices between 16TiB-32GiB and 16TiB-1B Created: 09/Nov/22  Updated: 13/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ellis Wilson Assignee: Ellis Wilson
Resolution: Unresolved Votes: 0
Labels: None

Epic/Theme: ldiskfs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempting to mkfs.lustre a disk below 16TiB in size but above the value computed for "resize" by mkfs.lustre (for 4KiB blocks, 32GiB shy of 16TiB) results in a failure as mke2fs requires resize to be greater than the specified capacity.

Example:
mkfs.lustre --ost --reformat --servicenode <elided> --fsname=lustrefs --index 1 --mgsnode <elided> --backfstype=ldiskfs /dev/ost1
mkfs.lustre FATAL: Unable to build fs /dev/ost1 (256)mkfs.lustre FATAL: mkfs failed 256   Permanent disk data:
Target:     lustrefs:OST0001
Index:      1  
Lustre FS:  lustrefs
Mount type: ldiskfs
Flags:      0x1062
              (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=<elided> mgsnode=<elided>device size = 16777152MB
formatting backing filesystem ldiskfs on /dev/ost1
  target name   lustrefs:OST0001
  kilobytes     17179803648
  options         -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F
mkfs_cmd = mke2fs -j -b 4096 -L lustrefs:OST0001   -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F /dev/ost1 17179803648k
   detected raid stride 4194304 too large, use optimum 512
   detected raid stripe width 67108864 too large, use optimum 512
   The resize maximum must be greater than the filesystem size.
   
   
   Bad option(s) specified:   Extended options are separated by commas, and may take an argument which
    is set off by an equals ('=') sign.    Valid extended options are:
    mmp_update_interval=<interval>
    num_backup_sb=<0|1|2>
    stride=<RAID per-disk data chunk in blocks>
    stripe-width=<RAID stride * data disks in blocks> 
    offset=<offset to create the file system>
    resize=<resize maximum size in blocks>
    packed_meta_blocks=<0 to disable, 1 to enable>
    lazy_itable_init=<0 to disable, 1 to enable>
    lazy_journal_init=<0 to disable, 1 to enable>
    root_owner=<uid of root dir>:<gid of root dir>
    test_fs
    discard
    nodiscard
    encoding=<encoding>
    encoding_flags=<flags>
    quotatype=<quota type(s) to be enabled>



 Comments   
Comment by Andreas Dilger [ 09/Nov/22 ]

Ellis, thanks for filing the ticket. Until this is fixed in the code, it should be possible to work around the issue by adding ",^resize_inode" to the "-O" feature list, and removing "resize=4290772992," from the "-E" extended options list on the mkfs.lustre command line.

Comment by Ellis Wilson [ 09/Nov/22 ]

No problem!  I fixed it with the following patch (going through the mechanics presently to test/submit the patch):

 

--- a/lustre/utils/libmount_utils_ldiskfs.c
+++ b/lustre/utils/libmount_utils_ldiskfs.c
@@ -885,6 +885,15 @@ int ldiskfs_make_lustre(struct mkfs_opts *mop)
                append_unique(start, ext_opts ? "," : " -E ",
                              "resize", buf, maxbuflen);
                ext_opts = 1;
+
+               /* The resize maximum must be greater than filesystem size, but for disks
+                * or arrays just shy of 16TiB you can get into a situation where capacity
+                * is between resize_blks and 16TiB.    Shrink the drive size to 1MiB less
+                * than resize in these scenarios (at most ~0.1% capacity is lost). 
+                */
+               if (resize_blks <= mop->mo_device_kb / mop->mo_blocksize_kb) {
+                       mop->mo_device_kb = (long long)(resize_blks) * (long long)mop->mo_blocksize_kb - 1024;
+               }
        }
 
        /* Avoid zeroing out the full journal - speeds up mkfs */

{{

As posted on lustre-discuss, I have some questions about the intent behind resize, and IDK how to get this assigned to me (maybe that's only for WC people).

Comment by Andreas Dilger [ 10/Nov/22 ]

Rather than shrink the MDT device, it would be better to just disable the resize_inode feature for such filesystems, since it is not useful for filesystems over 16TiB anyway.

Comment by Ellis Wilson [ 10/Nov/22 ]

I believe this only applies to OSTs, and while I can disable it I'd like to better understand what the optimization is attempting to accomplish first.  I think you put this block in around 2011 (could totally be wrong – it's moved around a few times).  Do you remember what it was accomplishing?  I'm really struggling to understand this comment block:
   871   /* In order to align the filesystem metadata on 1MB boundaries,
   872    * give a resize value that will reserve a power-of-two group
   873    * descriptor blocks, but leave one block for the superblock.
   874    * Only useful for filesystems with < 2^32 blocks due to resize
   875    * limitations.

Is ext metadata really unaligned without specifying resize?  Some docs suggest that without giving this, mke2fs plans for up to 1024 times the original size of the filesystem, so I don't feel like this is a case where we're trying to plan ahead more than mke2fs already does.

Comment by Andreas Dilger [ 10/Nov/22 ]

The resize_inode feature only works up to 16TB, so it is basically useless for the problematic filesystem and may as well be disabled for such filesystems. There is a different feature (meta_bg) that is used for resizing filesystems beyond 16TB. The 1024x resize is based on a starting filesystem size that is much smaller.

Yes, the ext4 metadata is not aligned to 1MB boundaries by default, and this option (along with some others added in the same patch) ensures that other metadata was located with proper 1MB alignment for HDD RAID alignment. That is not so important for flash MDTs at this point either.

So my approach to fixing this issue would be to disable the resize_inode feature (if this isn't done automatically already) and not specify the "-E resize=nnnn" option for filesystems that are close to 16TB in size.

Comment by Ellis Wilson [ 10/Nov/22 ]

Thanks for the clarification Andreas.  I've revised my in-house fix, and will run it through the steps on your submitting changes wiki shortly.

Comment by Peter Jones [ 13/Nov/22 ]

elliswilson I have added you to the developers group for the community project so you should now be able to do things like assign tickets to yourself etc

Generated at Sat Feb 10 03:25:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.