Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10520

Cannot format large MDT with ldiskfs

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.11.0
    • Lustre 2.10.0, Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      Taken from: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/450

       

      Description

      When trying to create a MDT on a 29TB volume, the mke2fs command receive ^extents instead of extents which cause the command to fails.

      Repro

       

      Create an MDT on a large LUN.

       

      modprobe osd_ldiskfs: 0
      
      
      mkfs.lustre --mdt --mgsnode=172.21.61.200@tcp0 --mgsnode=172.21.61.206@tcp0 --failnode=172.21.61.200@tcp0 --reformat --index=0 --mkfsoptions=-I 512 -i 2048 -J size=2048 --backfstype=ldiskfs --fsname=BIGSI01 /dev/mapper/mpathb: 1
      
         Permanent disk data:
      Target:     BIGSI01:MDT0000
      Index:      0
      Lustre FS:  BIGSI01
      Mount type: ldiskfs
      Flags:      0x61
                    (MDT first_time update )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters:  mgsnode=172.21.61.200@tcp:172.21.61.206@tcp failover.node=172.21.61.200@tcp
      
      device size = 30501008MB
      formatting backing filesystem ldiskfs on /dev/mapper/mpathb
      	target name   BIGSI01:MDT0000
      	4k blocks     7808258048
      	options       -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,64bit,flex_bg -E lazy_journal_init -F
      mkfs_cmd = mke2fs -j -b 4096 -L BIGSI01:MDT0000 -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,64bit,flex_bg -E lazy_journal_init -F /dev/mapper/mpathb 7808258048
         Found a gpt partition table in /dev/mapper/mpathb
         Extents MUST be enabled for a 64-bit filesystem.  Pass -O extents to rectify.
      
      
      mkfs.lustre FATAL: Unable to build fs /dev/mapper/mpathb (256)
      
      mkfs.lustre FATAL: mkfs failed 256
       
      
      

      If I manually try the command and remove the "^" character, the command succeeds.

      Attachments

        Issue Links

          Activity

            [LU-10520] Cannot format large MDT with ldiskfs
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31037/
            Subject: LU-10520 mkfs: enable extents for big MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: eb65c3a586f1efd425f8360972b5d365cfecf7e1

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31037/ Subject: LU-10520 mkfs: enable extents for big MDT Project: fs/lustre-release Branch: master Current Patch Set: Commit: eb65c3a586f1efd425f8360972b5d365cfecf7e1

            There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release.

            Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available.

            Hi,
            When calculating MDT disk space consumption, we might not forget space used by Changelog entries. As a rule of thumb, a Changelog entry is of an average of 125 B. And for instance, when touching a new file, if all Changelog entry types are recorded, 3 Changelog entries are generated (CREATE, OPEN, CLOSE).
            My point is space over 8TB on MDT device can be useful to store Changelog entries before they are consumed and cleared.

            Cheers,
            Sebastien.

            sbuisson Sebastien Buisson (Inactive) added a comment - There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release. Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available. Hi, When calculating MDT disk space consumption, we might not forget space used by Changelog entries. As a rule of thumb, a Changelog entry is of an average of 125 B. And for instance, when touching a new file, if all Changelog entry types are recorded, 3 Changelog entries are generated (CREATE, OPEN, CLOSE). My point is space over 8TB on MDT device can be useful to store Changelog entries before they are consumed and cleared. Cheers, Sebastien.

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/31037
            Subject: LU-10520 mkfs: enable extents for big MDT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e9ce401071c41b07d494291183d92f3beace24c4

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/31037 Subject: LU-10520 mkfs: enable extents for big MDT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e9ce401071c41b07d494291183d92f3beace24c4

            Note that formatting a ZFS MDT if 29 TB is not a necessarily a waste of space, since ZFS dynamically allocates inodes in the filesystem, though it uses about twice as much space per inode (4KB vs 2KB) compared to ldiskfs. That means if it is used to create a lot of files (as the MDT is traditionally used, it could hold about 7B inodes, or if it was used to hold 64KB files for DoM it could hold about 450M files.

            adilger Andreas Dilger added a comment - Note that formatting a ZFS MDT if 29 TB is not a necessarily a waste of space, since ZFS dynamically allocates inodes in the filesystem, though it uses about twice as much space per inode (4KB vs 2KB) compared to ldiskfs. That means if it is used to create a lot of files (as the MDT is traditionally used, it could hold about 7B inodes, or if it was used to hold 64KB files for DoM it could hold about 450M files.
            ys Yang Sheng added a comment -

            Hi, Louis,

            The '^' means disabled feature. That said, we disable extents feature on MDT default.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Louis, The '^' means disabled feature. That said, we disable extents feature on MDT default. Thanks, YangSheng
            lbailleul Louis Bailleul (Inactive) added a comment - - edited

            Hi,

            Thanks for the quick clarification.
            This is a test system and basically the MDT was built out of one of the LUN that was supposed to be an OST.
            There was no intention to use DoM at this point (but it is good to hear that the feature is coming).

            I was suspecting an issue with the size of the MDT as reducing the LUN from 29TB to 2TB allows it to format properly.

            Although I still get the weird "^extents" in the mke2fs parameters list (is this a typo, or does '^' has special meaning ?).

            mkfs.lustre --mdt --mgsnode=172.21.61.200@tcp0 --mgsnode=172.21.61.206@tcp0 --failnode=172.21.61.200@tcp0 --reformat --index=0 --mkfsoptions=-I 512 -i 2048 -J size=2048 --backfstype=ldiskfs --fsname=TEST01 /dev/mapper/mpathb: 0

            Permanent disk data:
            Target: TEST01:MDT0000
            Index: 0
            Lustre FS: TEST01
            Mount type: ldiskfs
            Flags: 0x61
            (MDT first_time update )
            Persistent mount opts: user_xattr,errors=remount-ro
            Parameters: mgsnode=172.21.61.200@tcp:172.21.61.206@tcp failover.node=172.21.61.200@tcp

            device size = 1992294MB
            formatting backing filesystem ldiskfs on /dev/mapper/mpathb
            target name TEST01:MDT0000
            4k blocks 510027366
            options -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
            mkfs_cmd = mke2fs -j -b 4096 -L TEST01:MDT0000 -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/mpathb 510027366

             

            Also last thing, creating a zpool of 29TB and creating an MDT on top of it works, even if as you mentioned, while running 2.10 this is mostly wasting space.

            lbailleul Louis Bailleul (Inactive) added a comment - - edited Hi, Thanks for the quick clarification. This is a test system and basically the MDT was built out of one of the LUN that was supposed to be an OST. There was no intention to use DoM at this point (but it is good to hear that the feature is coming). I was suspecting an issue with the size of the MDT as reducing the LUN from 29TB to 2TB allows it to format properly. Although I still get the weird "^extents" in the mke2fs parameters list (is this a typo, or does '^' has special meaning ?). mkfs.lustre --mdt --mgsnode=172.21.61.200@tcp0 --mgsnode=172.21.61.206@tcp0 --failnode=172.21.61.200@tcp0 --reformat --index=0 --mkfsoptions=-I 512 -i 2048 -J size=2048 --backfstype=ldiskfs --fsname=TEST01 /dev/mapper/mpathb: 0 Permanent disk data: Target: TEST01:MDT0000 Index: 0 Lustre FS: TEST01 Mount type: ldiskfs Flags: 0x61 (MDT first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: mgsnode=172.21.61.200@tcp:172.21.61.206@tcp failover.node=172.21.61.200@tcp device size = 1992294MB formatting backing filesystem ldiskfs on /dev/mapper/mpathb target name TEST01:MDT0000 4k blocks 510027366 options -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L TEST01:MDT0000 -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/mpathb 510027366   Also last thing, creating a zpool of 29TB and creating an MDT on top of it works, even if as you mentioned, while running 2.10 this is mostly wasting space.

            There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release.

            Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available.

            If you are formatting this very large MDT in anticipation of the DoM feature, and are aware of this limitation that is OK. We need to make a patch to libmount_utils_ldiskfs.c to enable the extents feature only for MDT filesystems over 16TB in size.

            adilger Andreas Dilger added a comment - There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release. Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available. If you are formatting this very large MDT in anticipation of the DoM feature, and are aware of this limitation that is OK. We need to make a patch to libmount_utils_ldiskfs.c to enable the extents feature only for MDT filesystems over 16TB in size.
            pjones Peter Jones added a comment -

            Yang Sheng

            Could you please look into this?

            Thanks

            Peter

            pjones Peter Jones added a comment - Yang Sheng Could you please look into this? Thanks Peter
            joe.grund Joe Grund added a comment -

            Ok found where the "^extends" come from :

            It's in build in the mkfs.lustre binary from lustre-2.10.2-1.src.rpm

            Exactly it seems to come from here:
            lustre/utils/mount_utils_ldiskfs.c:601: append_unique(anchor, ",", "^extents", NULL, maxbuflen);

            The line is the same for all the 2.10 releases, so this might not be entirely related to the issue (or else this probably would have shown in more places).

            Although, I am not sure that line 601 is supposed to look like that, as line 600 which add "uninit_bg" to the options list is :

            lustre/utils/mount_utils_ldiskfs.c:600: append_unique(anchor, ",", "uninit_bg", NULL, maxbuflen);

            joe.grund Joe Grund added a comment - Ok found where the "^extends" come from : It's in build in the mkfs.lustre binary from lustre-2.10.2-1.src.rpm Exactly it seems to come from here: lustre/utils/mount_utils_ldiskfs.c:601: append_unique(anchor, ",", "^extents", NULL, maxbuflen); The line is the same for all the 2.10 releases, so this might not be entirely related to the issue (or else this probably would have shown in more places). Although, I am not sure that line 601 is supposed to look like that, as line 600 which add "uninit_bg" to the options list is : lustre/utils/mount_utils_ldiskfs.c:600: append_unique(anchor, ",", "uninit_bg", NULL, maxbuflen);

            People

              ys Yang Sheng
              joe.grund Joe Grund
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: