[LU-255] use ext4 features by default for newly formatted filesystems - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.1.0
Labels:
None

Rank (Obsolete):
5035

Description

There are a number of ext4 features that we should be enabling by default for newly-formatted ldiskfs filesystems. In particular, the flex_bg option is important for reducing e2fsck time as well as avoiding "slow first write" issues that have hit a number of customers with fuller OSTs. Using flex_bg would avoid 10-minute delay at mount time or for each e2fsck run. As well, it would be useful to also enable other features like huge_file (files > 2TB) and dir_nlink (> 65000 subdirectories) by default.

All of these features are enabled by default if we format the filesystem with the option "-t ext4". Alternately, we could enable these individually in enable_default_backfs_features().

See http://events.linuxfoundation.org/slides/2010/linuxcon_japan/linuxcon_jp2010_fujita.pdf for a summary of improvements. While we won't see the 12h e2fsck -> 5 minute e2fsck improvement shown there (we already use extents and uninit_bg), the flex_bg feature is definitely still a win.

Attachments

Activity

[LU-255] use ext4 features by default for newly formatted filesystems

Andreas Dilger added a comment - 15/May/11 11:40 PM

Realistically, it is very unlikely to re-use anything from the internal journal in this case. The journal superblock will be rewritten, with a new journal transaction ID of 1, and marking no oustanding transactions to recover, and when it is mounted the TID will increment from 1.

If the node crashed before it had overwritten the journal (unlikely even under relatively low usage) there would still need to be transactions left in the journal that aligned right after the end of the current transaction, and also with the next TID in sequence.

In practice I think the chance of this is very low except in test filesystems that are reformatted repeatedly after a very short lifespan, but if you want I could drop this part of the patch. It avoids 400MB of IO to the device at mke2fs time, but even then this is a small portion of the inode table blocks being written.

Andreas Dilger added a comment - 15/May/11 11:40 PM Realistically, it is very unlikely to re-use anything from the internal journal in this case. The journal superblock will be rewritten, with a new journal transaction ID of 1, and marking no oustanding transactions to recover, and when it is mounted the TID will increment from 1. If the node crashed before it had overwritten the journal (unlikely even under relatively low usage) there would still need to be transactions left in the journal that aligned right after the end of the current transaction, and also with the next TID in sequence. In practice I think the chance of this is very low except in test filesystems that are reformatted repeatedly after a very short lifespan, but if you want I could drop this part of the patch. It avoids 400MB of IO to the device at mke2fs time, but even then this is a small portion of the inode table blocks being written.

Oleg Drokin added a comment - 15/May/11 10:35 PM

I wonder how safe is it to not zero the journal? Suppose this is mkfs on top of previous ext4. Could it happen then that in certain cases old transactions from the journal would be picked up?

Oleg Drokin added a comment - 15/May/11 10:35 PM I wonder how safe is it to not zero the journal? Suppose this is mkfs on top of previous ext4. Could it happen then that in certain cases old transactions from the journal would be picked up?

Andreas Dilger added a comment - 15/May/11 9:44 AM

Ihara, thanks for testing. Did you teat on 2.x or 1.8?

As for the problem hit on the MDT, I agree that the mkfs.lustre command should handle this case better. However, I also think that it doesn't make sense to have a 16TB MDT because that much space will never be used. One of the changes being made in this patch is to reduce the default inode ratio to 2048 bytes per inode, which is still very safe but allows more inodes for a given LUN size. I would recommend simply using a smaller LUN for the MDT. With the new inode ratio 8TB is enough for the maximum 4B inodes.

Andreas Dilger added a comment - 15/May/11 9:44 AM Ihara, thanks for testing. Did you teat on 2.x or 1.8? As for the problem hit on the MDT, I agree that the mkfs.lustre command should handle this case better. However, I also think that it doesn't make sense to have a 16TB MDT because that much space will never be used. One of the changes being made in this patch is to reduce the default inode ratio to 2048 bytes per inode, which is still very safe but allows more inodes for a given LUN size. I would recommend simply using a smaller LUN for the MDT. With the new inode ratio 8TB is enough for the maximum 4B inodes.

Shuichi Ihara (Inactive) added a comment - 15/May/11 7:50 AM

Formatting MDT also worked, when I added --mkfsoptions="-i 4096" to mkfs.lustre...

Shuichi Ihara (Inactive) added a comment - 15/May/11 7:50 AM Formatting MDT also worked, when I added --mkfsoptions="-i 4096" to mkfs.lustre...

Shuichi Ihara (Inactive) added a comment - 15/May/11 7:24 AM

I'm also interested in this patches and just tested patched RPMs. When I formatted the MDT (16TB), it failed due to the following errors. Any advises? OST format worked well.

mkfs.lustre --verbose --reformat --mgs --mdt /dev/mpath/mdt

Permanent disk data:
Target: lustre-MDTffff
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

device size = 14934016MB
formatting backing filesystem ldiskfs on /dev/mpath/mdt
target name lustre-MDTffff
4k blocks 3823108096
options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F /dev/mpath/mdt 3823108096
cmd: mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F /dev/mpath/mdt 3823108096
mke2fs 1.41.12.2.ora1 (14-Aug-2010)
mke2fs: too many inodes (7646216192), raise inode ratio?

mkfs.lustre FATAL: Unable to build fs /dev/mpath/mdt (256)

mkfs.lustre FATAL: mkfs failed 256

Shuichi Ihara (Inactive) added a comment - 15/May/11 7:24 AM I'm also interested in this patches and just tested patched RPMs. When I formatted the MDT (16TB), it failed due to the following errors. Any advises? OST format worked well. mkfs.lustre --verbose --reformat --mgs --mdt /dev/mpath/mdt Permanent disk data: Target: lustre-MDTffff Index: unassigned Lustre FS: lustre Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: device size = 14934016MB formatting backing filesystem ldiskfs on /dev/mpath/mdt target name lustre-MDTffff 4k blocks 3823108096 options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F /dev/mpath/mdt 3823108096 cmd: mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F /dev/mpath/mdt 3823108096 mke2fs 1.41.12.2.ora1 (14-Aug-2010) mke2fs: too many inodes (7646216192), raise inode ratio? mkfs.lustre FATAL: Unable to build fs /dev/mpath/mdt (256) mkfs.lustre FATAL: mkfs failed 256

Andreas Dilger added a comment - 13/May/11 6:12 PM

Oleg, this patch should be included into the 2.1 release - it dramatically speeds up mkfs and should fix (for new filesystems) the slow startup problems seen in ~~LU-15~~.

Andreas Dilger added a comment - 13/May/11 6:12 PM Oleg, this patch should be included into the 2.1 release - it dramatically speeds up mkfs and should fix (for new filesystems) the slow startup problems seen in LU-15 .

Andreas Dilger added a comment - 05/May/11 12:46 AM

Jeremy, test RPMs are available via http://review.whamcloud.com/#change,480 if you are able to test them. They are built from the lustre-release repository, so the mkfs.lustre is not directly useful to you if you are testing on 1.8.x.

The default parameters for an OST with this patch (assuming a large-enough LUN size and ext4-based ldiskfs) are:

mke2fs -j -b 4096 -L lustre-OSTffff -J size=400 -I 256 -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 -E resize=4290772992,lazy_journal_init, -F

{dev}

For an MDT they are:

mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F {dev}

Andreas Dilger added a comment - 05/May/11 12:46 AM Jeremy, test RPMs are available via http://review.whamcloud.com/#change,480 if you are able to test them. They are built from the lustre-release repository, so the mkfs.lustre is not directly useful to you if you are testing on 1.8.x. The default parameters for an OST with this patch (assuming a large-enough LUN size and ext4-based ldiskfs) are: mke2fs -j -b 4096 -L lustre-OSTffff -J size=400 -I 256 -i 262144 -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 -E resize=4290772992,lazy_journal_init, -F {dev} For an MDT they are: mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init, -F {dev}

Peter Jones added a comment - 02/May/11 7:27 AM

Andreas seems to be working on this

Peter Jones added a comment - 02/May/11 7:27 AM Andreas seems to be working on this

Jeremy Filizetti added a comment - 29/Apr/11 2:46 PM

I'll be running some testing with ~8 TB and larger LUNs over the next few weeks to see the performance impacts of various settings for the groups in a flexible block group, when I have some results I will post them here. My main focus though is to alleviate the slow mounts and issues from ~~LU-15~~. At least mkfs.lustre for a 9 TB LUN drops from 17 minutes to 6 minutes with of >64 for the number of groups.

Jeremy Filizetti added a comment - 29/Apr/11 2:46 PM I'll be running some testing with ~8 TB and larger LUNs over the next few weeks to see the performance impacts of various settings for the groups in a flexible block group, when I have some results I will post them here. My main focus though is to alleviate the slow mounts and issues from LU-15 . At least mkfs.lustre for a 9 TB LUN drops from 17 minutes to 6 minutes with of >64 for the number of groups.

People

Assignee:: Andreas Dilger

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Apr/11 12:36 PM

Updated:: 19/May/11 1:00 AM

Resolved:: 19/May/11 1:00 AM