[LU-1366] getting "dirdata length set incorrectly" running e2fsck - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Minor
Fix Version/s: Lustre 2.1.2
Affects Version/s: Lustre 2.1.1
Labels:
- llnl
Environment:
DDN SFA10k - Dell R710 - TOSS2.0 OS release

Severity:
3
Rank (Obsolete):
4619

Description

After adding a network to the file system and adding the IP for the failover node to the MDS it wouldn't mount. (I later found that --param failnode= is no longer valid - much to my chagrin) I attempted to run fsck against the file system but it responded that the e2fsprogs was out of date for the file system so I ran fsck.ldiskfs. The fsck.ldiskfs found some bad inodes and corrected them but on a subsequent run with the -n option (done to make sure it was clean) I started seeing a flood of "dirdata length set incorrectly" messages. I stopped it and was able to mount the FS but later the FS spontaneously unmounted.

What does this mean? Fortunately this file system is in pre-production and can be recreated (which is intended) but I'd like to know if this was caused by running fsck.ldiskfs since I did not see these messages on the first pass. The version of e2fsprogs (non-Redhat) is ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64. I have downloaded the wc4 version from the WC repo and installed it into a test image where I have rebooted the node into. I was able to use e2fsck to check the FS and I am using -fDy options but the "dirdata length set incorrectly" message continues to stream and has been going for more that an hour.

Any help would be appreciated.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fsck-c-cluster
677 kB
09/May/12 6:46 PM

Issue Links

is related to

LU-1774 fsck -fD corrupts filesystem

Resolved

LU-1540 e2fsck remove too many symlinks

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....

Sub-Tasks

Progress

short symlinks on MDT with "extents" have EXT4_EXTENTS_FL set

Resolved

Emoly Liu

Activity

[LU-1366] getting "dirdata length set incorrectly" running e2fsck

Andreas Dilger added a comment - 16/May/12 1:00 AM

FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed.

The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661.

Andreas Dilger added a comment - 16/May/12 1:00 AM FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed. The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661 .

Joe Mervini added a comment - 15/May/12 10:17 PM

To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors.

Joe Mervini added a comment - 15/May/12 10:17 PM To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors.

Joe Mervini added a comment - 15/May/12 7:26 PM

I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

So is it your opinion that we should start from scratch again while we have the chance?

Joe Mervini added a comment - 15/May/12 7:26 PM I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for: Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize So is it your opinion that we should start from scratch again while we have the chance?

Andreas Dilger added a comment - 15/May/12 6:47 PM

In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT.

I think a combination of factors is required here, to avoid this problem for other filesystems:

explicitly disable "extents" for MDT filesystems in mkfs_lustre.c (should go into 2.1.x)
fix e2fsck so that it does not corrupt extent-mapped symlinks (this may already be fixed in newer e2fsprogs)
land the OI Scrub feature for 2.3 (this is likely too much of a "feature" for 2.1.x)

Andreas Dilger added a comment - 15/May/12 6:47 PM In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT. I think a combination of factors is required here, to avoid this problem for other filesystems: explicitly disable "extents" for MDT filesystems in mkfs_lustre.c (should go into 2.1.x) fix e2fsck so that it does not corrupt extent-mapped symlinks (this may already be fixed in newer e2fsprogs) land the OI Scrub feature for 2.3 (this is likely too much of a "feature" for 2.1.x)

Joe Mervini added a comment - 15/May/12 6:33 PM

All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it.

Joe Mervini added a comment - 15/May/12 6:33 PM All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it.

Andreas Dilger added a comment - 15/May/12 6:29 PM

Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs

{lots of options}

", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST.

In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled.

It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features.

Andreas Dilger added a comment - 15/May/12 6:29 PM Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs {lots of options} ", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST. In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled. It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features.

Joe Mervini added a comment - 15/May/12 6:07 PM

Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file
huge_file uninit_bg dir_nlink extra_isize

On the RHEL-6.1 created MDT these are the features:
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink

I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there.

[root@cmds1 ~]# mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000

Permanent disk data:
Target: scratch2-MDTffff
Index: unassigned
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: failover.node=10.196.135.143@o2ib1

device size = 2858688MB
formatting backing filesystem ldiskfs on /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
target name scratch2-MDTffff
4k blocks 731824160
options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
cmd: mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
mkfs.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011)
Discarding device blocks: failed - Operation not supported
Filesystem label=scratch2-MDTffff
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1463654128 inodes, 731824160 blocks
36591208 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2880079872
44689 block groups
16376 blocks per group, 16376 fragments per group
32752 inodes per group
Superblock backups stored on blocks:
16376, 49128, 81880, 114632, 147384, 409400, 442152, 802424, 1326456,
2047000, 3979368, 5616968, 10235000, 11938104, 35814312, 39318776,
51175000, 107442936, 255875000, 275231432, 322328808

Allocating group tables: done
Writing inode tables: done
Creating journal (102400 blocks): done
Multiple mount protection has been enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 0 mounts or
0 days, whichever comes first. Use tunefs.ldiskfs -c or -i to override.
Writing CONFIGS/mountdata

[root@cmds1 ~]# dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
dumpe2fs 1.41.12 (17-May-2010)
Filesystem volume name: scratch2-MDTffff
Last mounted on: /ram/tmp/mntBrCPMe
Filesystem UUID: 48d57c6d-8156-4eb7-bcf8-a298fc0f7af9
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1463654128
Block count: 731824160
Reserved block count: 36591208
Free blocks: 548645355
Free inodes: 1463654115
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1024
Blocks per group: 16376
Fragments per group: 16376
Inodes per group: 32752
Inode blocks per group: 4094
Flex block group size: 16
Filesystem created: Tue May 15 15:31:37 2012
Last mount time: Tue May 15 16:02:53 2012
Last write time: Tue May 15 16:02:55 2012
Mount count: 1
Maximum mount count: 20
Last checked: Tue May 15 15:31:37 2012
Check interval: 0 (<none>)
Lifetime writes: 698 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: f6500d19-ed37-48f6-a446-e8485a2f9edf
Journal backup: inode blocks
Journal features: (none)
Journal size: 400M
Journal length: 102400
Journal sequence: 0x00000005
Journal start: 0

Joe Mervini added a comment - 15/May/12 6:07 PM Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize On the RHEL-6.1 created MDT these are the features: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there. [root@cmds1 ~] # mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 Permanent disk data: Target: scratch2-MDTffff Index: unassigned Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: failover.node=10.196.135.143@o2ib1 device size = 2858688MB formatting backing filesystem ldiskfs on /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 target name scratch2-MDTffff 4k blocks 731824160 options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F mkfs_cmd = mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160 cmd: mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160 mkfs.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011) Discarding device blocks: failed - Operation not supported Filesystem label=scratch2-MDTffff OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 1463654128 inodes, 731824160 blocks 36591208 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2880079872 44689 block groups 16376 blocks per group, 16376 fragments per group 32752 inodes per group Superblock backups stored on blocks: 16376, 49128, 81880, 114632, 147384, 409400, 442152, 802424, 1326456, 2047000, 3979368, 5616968, 10235000, 11938104, 35814312, 39318776, 51175000, 107442936, 255875000, 275231432, 322328808 Allocating group tables: done Writing inode tables: done Creating journal (102400 blocks): done Multiple mount protection has been enabled with update interval 5 seconds. Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 0 mounts or 0 days, whichever comes first. Use tunefs.ldiskfs -c or -i to override. Writing CONFIGS/mountdata [root@cmds1 ~] # dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 dumpe2fs 1.41.12 (17-May-2010) Filesystem volume name: scratch2-MDTffff Last mounted on: /ram/tmp/mntBrCPMe Filesystem UUID: 48d57c6d-8156-4eb7-bcf8-a298fc0f7af9 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 1463654128 Block count: 731824160 Reserved block count: 36591208 Free blocks: 548645355 Free inodes: 1463654115 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 16376 Fragments per group: 16376 Inodes per group: 32752 Inode blocks per group: 4094 Flex block group size: 16 Filesystem created: Tue May 15 15:31:37 2012 Last mount time: Tue May 15 16:02:53 2012 Last write time: Tue May 15 16:02:55 2012 Mount count: 1 Maximum mount count: 20 Last checked: Tue May 15 15:31:37 2012 Check interval: 0 (<none>) Lifetime writes: 698 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: f6500d19-ed37-48f6-a446-e8485a2f9edf Journal backup: inode blocks Journal features: (none) Journal size: 400M Journal length: 102400 Journal sequence: 0x00000005 Journal start: 0

Andreas Dilger added a comment - 15/May/12 4:57 PM

Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed.

Andreas Dilger added a comment - 15/May/12 4:57 PM Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed.

Andreas Dilger added a comment - 15/May/12 4:55 PM

Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem:

# dumpe2fs -h /tmp/lustre-mdt1  | grep feature
dumpe2fs 1.41.90.wc3 (28-May-2011)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink
Journal features:         (none)

Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk).

Andreas Dilger added a comment - 15/May/12 4:55 PM Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem: # dumpe2fs -h /tmp/lustre-mdt1 | grep feature dumpe2fs 1.41.90.wc3 (28-May-2011) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink Journal features: (none) Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk).

Joe Mervini added a comment - 15/May/12 4:47 PM

A little more data: No dirdata length errors with -fDy option on RHEL-6.1

[root@cmds1 osc]# e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
e2fsck 1.41.90.wc4 (01-Sep-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
scratch2-MDT0000: 2910/1463654128 files (0.2% non-contiguous), 183179282/731824160 blocks

Joe Mervini added a comment - 15/May/12 4:47 PM A little more data: No dirdata length errors with -fDy option on RHEL-6.1 [root@cmds1 osc] # e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 e2fsck 1.41.90.wc4 (01-Sep-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 3A: Optimizing directories Pass 4: Checking reference counts Pass 5: Checking group summary information scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** scratch2-MDT0000: 2910/1463654128 files (0.2% non-contiguous), 183179282/731824160 blocks

Joe Mervini added a comment - 15/May/12 4:15 PM

The problem appears to be TOSS2 specific.

As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related.

I will be opening a bugzilla bug with Livermore.

Joe Mervini added a comment - 15/May/12 4:15 PM The problem appears to be TOSS2 specific. As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related. I will be opening a bugzilla bug with Livermore.

People

Assignee:: Zhenyu Xu

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/May/12 2:02 PM

Updated:: 16/Aug/16 4:35 PM

Resolved:: 16/Aug/16 4:35 PM