[LU-255] use ext4 features by default for newly formatted filesystems - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.1.0
Labels:
None

Rank (Obsolete):
5035

Description

There are a number of ext4 features that we should be enabling by default for newly-formatted ldiskfs filesystems. In particular, the flex_bg option is important for reducing e2fsck time as well as avoiding "slow first write" issues that have hit a number of customers with fuller OSTs. Using flex_bg would avoid 10-minute delay at mount time or for each e2fsck run. As well, it would be useful to also enable other features like huge_file (files > 2TB) and dir_nlink (> 65000 subdirectories) by default.

All of these features are enabled by default if we format the filesystem with the option "-t ext4". Alternately, we could enable these individually in enable_default_backfs_features().

See http://events.linuxfoundation.org/slides/2010/linuxcon_japan/linuxcon_jp2010_fujita.pdf for a summary of improvements. While we won't see the 12h e2fsck -> 5 minute e2fsck improvement shown there (we already use extents and uninit_bg), the flex_bg feature is definitely still a win.

Attachments

Activity

[LU-255] use ext4 features by default for newly formatted filesystems

Build Master (Inactive) added a comment - 18/May/11 4:41 PM

Integrated in lustre-master » x86_64,client,el5,ofa #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/utils/mkfs_lustre.c
lustre/lvfs/fsfilt_ext3.c

Build Master (Inactive) added a comment - 18/May/11 4:41 PM Integrated in lustre-master » x86_64,client,el5,ofa #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/utils/mkfs_lustre.c lustre/lvfs/fsfilt_ext3.c

Build Master (Inactive) added a comment - 18/May/11 4:38 PM

Integrated in lustre-master » x86_64,client,el6,inkernel #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/lvfs/fsfilt_ext3.c
lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:38 PM Integrated in lustre-master » x86_64,client,el6,inkernel #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/lvfs/fsfilt_ext3.c lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:33 PM

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/lvfs/fsfilt_ext3.c
lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:33 PM Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/lvfs/fsfilt_ext3.c lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:31 PM

Integrated in lustre-master » i686,client,el5,ofa #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/lvfs/fsfilt_ext3.c
lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:31 PM Integrated in lustre-master » i686,client,el5,ofa #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/lvfs/fsfilt_ext3.c lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:30 PM

Integrated in lustre-master » i686,client,el5,inkernel #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/lvfs/fsfilt_ext3.c
lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:30 PM Integrated in lustre-master » i686,client,el5,inkernel #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/lvfs/fsfilt_ext3.c lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:28 PM

Integrated in lustre-master » x86_64,client,el5,inkernel #122
~~LU-255~~: enable ext4 features by default

Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91
Files :

lustre/lvfs/fsfilt_ext3.c
lustre/utils/mkfs_lustre.c

Build Master (Inactive) added a comment - 18/May/11 4:28 PM Integrated in lustre-master » x86_64,client,el5,inkernel #122 LU-255 : enable ext4 features by default Oleg Drokin : eb012d4a10208b26c2d3e795a90f1bb07dde6d91 Files : lustre/lvfs/fsfilt_ext3.c lustre/utils/mkfs_lustre.c

Shuichi Ihara (Inactive) added a comment - 16/May/11 3:12 PM

ah, yes. is it worth to test "-i 2048" on un-patched to make sure speedup? And, I'm going to test e2fsck to MDT and OST (in 0%, 50%, 80% usage case) on un-patched and patched.

Shuichi Ihara (Inactive) added a comment - 16/May/11 3:12 PM ah, yes. is it worth to test "-i 2048" on un-patched to make sure speedup? And, I'm going to test e2fsck to MDT and OST (in 0%, 50%, 80% usage case) on un-patched and patched.

Andreas Dilger added a comment - 16/May/11 11:14 AM

I suspect that the MDT format time is actually more than 2x as fast per_inode, because it is writing 2x as many inodes for the same amount of space (using "-i 2048" for patched, and "-i 4096" for unpatched). Even if it isn't running mke2fs faster on the MDT, it should also be running e2fsck faster due to flex_bg.

Andreas Dilger added a comment - 16/May/11 11:14 AM I suspect that the MDT format time is actually more than 2x as fast per_inode , because it is writing 2x as many inodes for the same amount of space (using "-i 2048" for patched, and "-i 4096" for unpatched). Even if it isn't running mke2fs faster on the MDT, it should also be running e2fsck faster due to flex_bg.

Shuichi Ihara (Inactive) added a comment - 16/May/11 7:52 AM - edited

I'm testing on 2.x. (got RPMs from http://review.whamcloud.com/#change,480) there are some test updates. We have a 8TB (changed size from 16TB) MDT and 16TB OSTs, here is time for mkfs.lustre.

      un-patched(sec)  patched(sec)  
MDT     3591               3361
OST     1836                 15

Formatting the OSTs was dramatically speedup, but didn't see big acceleration of formatting MDT.

Shuichi Ihara (Inactive) added a comment - 16/May/11 7:52 AM - edited I'm testing on 2.x. (got RPMs from http://review.whamcloud.com/#change,480 ) there are some test updates. We have a 8TB (changed size from 16TB) MDT and 16TB OSTs, here is time for mkfs.lustre. un-patched(sec) patched(sec) MDT 3591 3361 OST 1836 15 Formatting the OSTs was dramatically speedup, but didn't see big acceleration of formatting MDT.

Andreas Dilger added a comment - 15/May/11 11:40 PM

Realistically, it is very unlikely to re-use anything from the internal journal in this case. The journal superblock will be rewritten, with a new journal transaction ID of 1, and marking no oustanding transactions to recover, and when it is mounted the TID will increment from 1.

If the node crashed before it had overwritten the journal (unlikely even under relatively low usage) there would still need to be transactions left in the journal that aligned right after the end of the current transaction, and also with the next TID in sequence.

In practice I think the chance of this is very low except in test filesystems that are reformatted repeatedly after a very short lifespan, but if you want I could drop this part of the patch. It avoids 400MB of IO to the device at mke2fs time, but even then this is a small portion of the inode table blocks being written.

Andreas Dilger added a comment - 15/May/11 11:40 PM Realistically, it is very unlikely to re-use anything from the internal journal in this case. The journal superblock will be rewritten, with a new journal transaction ID of 1, and marking no oustanding transactions to recover, and when it is mounted the TID will increment from 1. If the node crashed before it had overwritten the journal (unlikely even under relatively low usage) there would still need to be transactions left in the journal that aligned right after the end of the current transaction, and also with the next TID in sequence. In practice I think the chance of this is very low except in test filesystems that are reformatted repeatedly after a very short lifespan, but if you want I could drop this part of the patch. It avoids 400MB of IO to the device at mke2fs time, but even then this is a small portion of the inode table blocks being written.

Oleg Drokin added a comment - 15/May/11 10:35 PM

I wonder how safe is it to not zero the journal? Suppose this is mkfs on top of previous ext4. Could it happen then that in certain cases old transactions from the journal would be picked up?

Oleg Drokin added a comment - 15/May/11 10:35 PM I wonder how safe is it to not zero the journal? Suppose this is mkfs on top of previous ext4. Could it happen then that in certain cases old transactions from the journal would be picked up?

People

Assignee:: Andreas Dilger

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Apr/11 12:36 PM

Updated:: 19/May/11 1:00 AM

Resolved:: 19/May/11 1:00 AM