Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16305

mkfs.lustre fails on devices between 16TiB-32GiB and 16TiB-1B

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Attempting to mkfs.lustre a disk below 16TiB in size but above the value computed for "resize" by mkfs.lustre (for 4KiB blocks, 32GiB shy of 16TiB) results in a failure as mke2fs requires resize to be greater than the specified capacity.

      Example:
      mkfs.lustre --ost --reformat --servicenode <elided> --fsname=lustrefs --index 1 --mgsnode <elided> --backfstype=ldiskfs /dev/ost1
      mkfs.lustre FATAL: Unable to build fs /dev/ost1 (256)mkfs.lustre FATAL: mkfs failed 256   Permanent disk data:
      Target:     lustrefs:OST0001
      Index:      1  
      Lustre FS:  lustrefs
      Mount type: ldiskfs
      Flags:      0x1062
                    (OST first_time update no_primnode )
      Persistent mount opts: ,errors=remount-ro
      Parameters: failover.node=<elided> mgsnode=<elided>device size = 16777152MB
      formatting backing filesystem ldiskfs on /dev/ost1
        target name   lustrefs:OST0001
        kilobytes     17179803648
        options         -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F
      mkfs_cmd = mke2fs -j -b 4096 -L lustrefs:OST0001   -J size=1024 -I 512 -i 524288 -q -O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 256 -E resize=\"4290772992\",lazy_journal_init=\"0\",lazy_itable_init=\"0\" -F /dev/ost1 17179803648k
         detected raid stride 4194304 too large, use optimum 512
         detected raid stripe width 67108864 too large, use optimum 512
         The resize maximum must be greater than the filesystem size.
         
         
         Bad option(s) specified:   Extended options are separated by commas, and may take an argument which
          is set off by an equals ('=') sign.    Valid extended options are:
          mmp_update_interval=<interval>
          num_backup_sb=<0|1|2>
          stride=<RAID per-disk data chunk in blocks>
          stripe-width=<RAID stride * data disks in blocks> 
          offset=<offset to create the file system>
          resize=<resize maximum size in blocks>
          packed_meta_blocks=<0 to disable, 1 to enable>
          lazy_itable_init=<0 to disable, 1 to enable>
          lazy_journal_init=<0 to disable, 1 to enable>
          root_owner=<uid of root dir>:<gid of root dir>
          test_fs
          discard
          nodiscard
          encoding=<encoding>
          encoding_flags=<flags>
          quotatype=<quota type(s) to be enabled>

      Attachments

        Issue Links

          Activity

            [LU-16305] mkfs.lustre fails on devices between 16TiB-32GiB and 16TiB-1B
            elliswilson Ellis Wilson added a comment -

            Roger.  Will keep that in mind going forward.

            elliswilson Ellis Wilson added a comment - Roger.  Will keep that in mind going forward.

            We typically use "Resolved" instead of "Closed" so that it is still possible to do things like add labels and make other changes to the ticket.

            adilger Andreas Dilger added a comment - We typically use "Resolved" instead of "Closed" so that it is still possible to do things like add labels and make other changes to the ticket.
            elliswilson Ellis Wilson added a comment -

            LU-17036 identified and fixed the same problem as this one.  Closing this out.

            elliswilson Ellis Wilson added a comment - LU-17036 identified and fixed the same problem as this one.  Closing this out.
            pjones Peter Jones added a comment -

            elliswilson I have added you to the developers group for the community project so you should now be able to do things like assign tickets to yourself etc

            pjones Peter Jones added a comment - elliswilson I have added you to the developers group for the community project so you should now be able to do things like assign tickets to yourself etc
            elliswilson Ellis Wilson added a comment -

            Thanks for the clarification Andreas.  I've revised my in-house fix, and will run it through the steps on your submitting changes wiki shortly.

            elliswilson Ellis Wilson added a comment - Thanks for the clarification Andreas.  I've revised my in-house fix, and will run it through the steps on your submitting changes wiki shortly.

            The resize_inode feature only works up to 16TB, so it is basically useless for the problematic filesystem and may as well be disabled for such filesystems. There is a different feature (meta_bg) that is used for resizing filesystems beyond 16TB. The 1024x resize is based on a starting filesystem size that is much smaller.

            Yes, the ext4 metadata is not aligned to 1MB boundaries by default, and this option (along with some others added in the same patch) ensures that other metadata was located with proper 1MB alignment for HDD RAID alignment. That is not so important for flash MDTs at this point either.

            So my approach to fixing this issue would be to disable the resize_inode feature (if this isn't done automatically already) and not specify the "-E resize=nnnn" option for filesystems that are close to 16TB in size.

            adilger Andreas Dilger added a comment - The resize_inode feature only works up to 16TB, so it is basically useless for the problematic filesystem and may as well be disabled for such filesystems. There is a different feature ( meta_bg ) that is used for resizing filesystems beyond 16TB. The 1024x resize is based on a starting filesystem size that is much smaller. Yes, the ext4 metadata is not aligned to 1MB boundaries by default, and this option (along with some others added in the same patch) ensures that other metadata was located with proper 1MB alignment for HDD RAID alignment. That is not so important for flash MDTs at this point either. So my approach to fixing this issue would be to disable the resize_inode feature (if this isn't done automatically already) and not specify the " -E resize=nnnn " option for filesystems that are close to 16TB in size.
            elliswilson Ellis Wilson added a comment -

            I believe this only applies to OSTs, and while I can disable it I'd like to better understand what the optimization is attempting to accomplish first.  I think you put this block in around 2011 (could totally be wrong – it's moved around a few times).  Do you remember what it was accomplishing?  I'm really struggling to understand this comment block:
               871   /* In order to align the filesystem metadata on 1MB boundaries,
               872    * give a resize value that will reserve a power-of-two group
               873    * descriptor blocks, but leave one block for the superblock.
               874    * Only useful for filesystems with < 2^32 blocks due to resize
               875    * limitations.

            Is ext metadata really unaligned without specifying resize?  Some docs suggest that without giving this, mke2fs plans for up to 1024 times the original size of the filesystem, so I don't feel like this is a case where we're trying to plan ahead more than mke2fs already does.

            elliswilson Ellis Wilson added a comment - I believe this only applies to OSTs, and while I can disable it I'd like to better understand what the optimization is attempting to accomplish first.  I think you put this block in around 2011 (could totally be wrong – it's moved around a few times).  Do you remember what it was accomplishing?  I'm really struggling to understand this comment block:    871   /* In order to align the filesystem metadata on 1MB boundaries,    872    * give a resize value that will reserve a power-of-two group    873    * descriptor blocks, but leave one block for the superblock.    874    * Only useful for filesystems with < 2^32 blocks due to resize    875    * limitations. Is ext metadata really unaligned without specifying resize?  Some docs suggest that without giving this, mke2fs plans for up to 1024 times the original size of the filesystem, so I don't feel like this is a case where we're trying to plan ahead more than mke2fs already does.

            Rather than shrink the MDT device, it would be better to just disable the resize_inode feature for such filesystems, since it is not useful for filesystems over 16TiB anyway.

            adilger Andreas Dilger added a comment - Rather than shrink the MDT device, it would be better to just disable the resize_inode feature for such filesystems, since it is not useful for filesystems over 16TiB anyway.
            elliswilson Ellis Wilson added a comment - - edited

            No problem!  I fixed it with the following patch (going through the mechanics presently to test/submit the patch):

             

            --- a/lustre/utils/libmount_utils_ldiskfs.c
            +++ b/lustre/utils/libmount_utils_ldiskfs.c
            @@ -885,6 +885,15 @@ int ldiskfs_make_lustre(struct mkfs_opts *mop)
                            append_unique(start, ext_opts ? "," : " -E ",
                                          "resize", buf, maxbuflen);
                            ext_opts = 1;
            +
            +               /* The resize maximum must be greater than filesystem size, but for disks
            +                * or arrays just shy of 16TiB you can get into a situation where capacity
            +                * is between resize_blks and 16TiB.    Shrink the drive size to 1MiB less
            +                * than resize in these scenarios (at most ~0.1% capacity is lost). 
            +                */
            +               if (resize_blks <= mop->mo_device_kb / mop->mo_blocksize_kb) {
            +                       mop->mo_device_kb = (long long)(resize_blks) * (long long)mop->mo_blocksize_kb - 1024;
            +               }
                    }
             
                    /* Avoid zeroing out the full journal - speeds up mkfs */
            
            {{

            As posted on lustre-discuss, I have some questions about the intent behind resize, and IDK how to get this assigned to me (maybe that's only for WC people).

            elliswilson Ellis Wilson added a comment - - edited No problem!  I fixed it with the following patch (going through the mechanics presently to test/submit the patch):   --- a/lustre/utils/libmount_utils_ldiskfs.c +++ b/lustre/utils/libmount_utils_ldiskfs.c @@ -885,6 +885,15 @@ int ldiskfs_make_lustre(struct mkfs_opts *mop)                 append_unique(start, ext_opts ? "," : " -E ",                               "resize", buf, maxbuflen);                 ext_opts = 1; + +               /* The resize maximum must be greater than filesystem size, but for disks +                * or arrays just shy of 16TiB you can get into a situation where capacity +                * is between resize_blks and 16TiB.    Shrink the drive size to 1MiB less +                * than resize in these scenarios (at most ~0.1% capacity is lost).  +                */ +               if (resize_blks <= mop->mo_device_kb / mop->mo_blocksize_kb) { +                       mop->mo_device_kb = (long long)(resize_blks) * (long long)mop->mo_blocksize_kb - 1024; +               }         }           /* Avoid zeroing out the full journal - speeds up mkfs */ {{ As posted on lustre-discuss, I have some questions about the intent behind resize, and IDK how to get this assigned to me (maybe that's only for WC people).
            adilger Andreas Dilger added a comment - - edited

            Ellis, thanks for filing the ticket. Until this is fixed in the code, it should be possible to work around the issue by adding ",^resize_inode" to the "-O" feature list, and removing "resize=4290772992," from the "-E" extended options list on the mkfs.lustre command line.

            adilger Andreas Dilger added a comment - - edited Ellis, thanks for filing the ticket. Until this is fixed in the code, it should be possible to work around the issue by adding " ,^resize_inode " to the " -O " feature list, and removing " resize=4290772992, " from the " -E " extended options list on the mkfs.lustre command line.

            People

              elliswilson Ellis Wilson
              elliswilson Ellis Wilson
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: