[LU-10520] Cannot format large MDT with ldiskfs Created: 16/Jan/18  Updated: 18/Apr/19  Resolved: 08/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Blocker
Reporter: Joe Grund Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-12196 mkfs.lustre should handle large MDTs ... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Taken from: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/450

 

Description

When trying to create a MDT on a 29TB volume, the mke2fs command receive ^extents instead of extents which cause the command to fails.

Repro

 

Create an MDT on a large LUN.

 

modprobe osd_ldiskfs: 0


mkfs.lustre --mdt --mgsnode=172.21.61.200@tcp0 --mgsnode=172.21.61.206@tcp0 --failnode=172.21.61.200@tcp0 --reformat --index=0 --mkfsoptions=-I 512 -i 2048 -J size=2048 --backfstype=ldiskfs --fsname=BIGSI01 /dev/mapper/mpathb: 1

   Permanent disk data:
Target:     BIGSI01:MDT0000
Index:      0
Lustre FS:  BIGSI01
Mount type: ldiskfs
Flags:      0x61
              (MDT first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:  mgsnode=172.21.61.200@tcp:172.21.61.206@tcp failover.node=172.21.61.200@tcp

device size = 30501008MB
formatting backing filesystem ldiskfs on /dev/mapper/mpathb
	target name   BIGSI01:MDT0000
	4k blocks     7808258048
	options       -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,64bit,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L BIGSI01:MDT0000 -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,64bit,flex_bg -E lazy_journal_init -F /dev/mapper/mpathb 7808258048
   Found a gpt partition table in /dev/mapper/mpathb
   Extents MUST be enabled for a 64-bit filesystem.  Pass -O extents to rectify.


mkfs.lustre FATAL: Unable to build fs /dev/mapper/mpathb (256)

mkfs.lustre FATAL: mkfs failed 256
 

If I manually try the command and remove the "^" character, the command succeeds.



 Comments   
Comment by Joe Grund [ 16/Jan/18 ]

Ok found where the "^extends" come from :

It's in build in the mkfs.lustre binary from lustre-2.10.2-1.src.rpm

Exactly it seems to come from here:
lustre/utils/mount_utils_ldiskfs.c:601: append_unique(anchor, ",", "^extents", NULL, maxbuflen);

The line is the same for all the 2.10 releases, so this might not be entirely related to the issue (or else this probably would have shown in more places).

Although, I am not sure that line 601 is supposed to look like that, as line 600 which add "uninit_bg" to the options list is :

lustre/utils/mount_utils_ldiskfs.c:600: append_unique(anchor, ",", "uninit_bg", NULL, maxbuflen);

Comment by Peter Jones [ 16/Jan/18 ]

Yang Sheng

Could you please look into this?

Thanks

Peter

Comment by Andreas Dilger [ 16/Jan/18 ]

There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release.

Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available.

If you are formatting this very large MDT in anticipation of the DoM feature, and are aware of this limitation that is OK. We need to make a patch to libmount_utils_ldiskfs.c to enable the extents feature only for MDT filesystems over 16TB in size.

Comment by Louis Bailleul [ 16/Jan/18 ]

Hi,

Thanks for the quick clarification.
This is a test system and basically the MDT was built out of one of the LUN that was supposed to be an OST.
There was no intention to use DoM at this point (but it is good to hear that the feature is coming).

I was suspecting an issue with the size of the MDT as reducing the LUN from 29TB to 2TB allows it to format properly.

Although I still get the weird "^extents" in the mke2fs parameters list (is this a typo, or does '^' has special meaning ?).

mkfs.lustre --mdt --mgsnode=172.21.61.200@tcp0 --mgsnode=172.21.61.206@tcp0 --failnode=172.21.61.200@tcp0 --reformat --index=0 --mkfsoptions=-I 512 -i 2048 -J size=2048 --backfstype=ldiskfs --fsname=TEST01 /dev/mapper/mpathb: 0

Permanent disk data:
Target: TEST01:MDT0000
Index: 0
Lustre FS: TEST01
Mount type: ldiskfs
Flags: 0x61
(MDT first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=172.21.61.200@tcp:172.21.61.206@tcp failover.node=172.21.61.200@tcp

device size = 1992294MB
formatting backing filesystem ldiskfs on /dev/mapper/mpathb
target name TEST01:MDT0000
4k blocks 510027366
options -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L TEST01:MDT0000 -I 512 -i 2048 -J size=2048 -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/mpathb 510027366

 

Also last thing, creating a zpool of 29TB and creating an MDT on top of it works, even if as you mentioned, while running 2.10 this is mostly wasting space.

Comment by Yang Sheng [ 17/Jan/18 ]

Hi, Louis,

The '^' means disabled feature. That said, we disable extents feature on MDT default.

Thanks,
YangSheng

Comment by Andreas Dilger [ 17/Jan/18 ]

Note that formatting a ZFS MDT if 29 TB is not a necessarily a waste of space, since ZFS dynamically allocates inodes in the filesystem, though it uses about twice as much space per inode (4KB vs 2KB) compared to ldiskfs. That means if it is used to create a lot of files (as the MDT is traditionally used, it could hold about 7B inodes, or if it was used to hold 64KB files for DoM it could hold about 450M files.

Comment by Gerrit Updater [ 26/Jan/18 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/31037
Subject: LU-10520 mkfs: enable extents for big MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9ce401071c41b07d494291183d92f3beace24c4

Comment by Sebastien Buisson (Inactive) [ 31/Jan/18 ]

There is no point in formatting an MDT filesystem larger than about 8-16TB, unless the new Data-on-MDT (DoM) feature is used, but that feature will not be available until the Lustre 2.11 release.

Since the MDT (without the DoM feature) only holds inodes (1KB in size for 2.10 and later) plus directories, xattrs, and some Lustre log files (average 2KB per inode), and there is an upper limit of 4B inodes, 4B * 2KB = 8TB. Having a larger MDT is largely a waste of space, since the extra space above 8TB cannot be used until the DoM feature is available.

Hi,
When calculating MDT disk space consumption, we might not forget space used by Changelog entries. As a rule of thumb, a Changelog entry is of an average of 125 B. And for instance, when touching a new file, if all Changelog entry types are recorded, 3 Changelog entries are generated (CREATE, OPEN, CLOSE).
My point is space over 8TB on MDT device can be useful to store Changelog entries before they are consumed and cleared.

Cheers,
Sebastien.

Comment by Gerrit Updater [ 08/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31037/
Subject: LU-10520 mkfs: enable extents for big MDT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eb65c3a586f1efd425f8360972b5d365cfecf7e1

Comment by Peter Jones [ 08/Mar/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:35:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.