[LU-1366] getting "dirdata length set incorrectly" running e2fsck Created: 03/May/12  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: Lustre 2.1.2

Type: Bug Priority: Minor
Reporter: Joe Mervini Assignee: Zhenyu Xu
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

DDN SFA10k - Dell R710 - TOSS2.0 OS release


Attachments: File fsck-c-cluster    
Issue Links:
Related
is related to LU-1774 fsck -fD corrupts filesystem Resolved
is related to LU-1540 e2fsck remove too many symlinks Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2634 short symlinks on MDT with "extents" ... Technical task Resolved Emoly Liu  
Severity: 3
Rank (Obsolete): 4619

 Description   

After adding a network to the file system and adding the IP for the failover node to the MDS it wouldn't mount. (I later found that --param failnode= is no longer valid - much to my chagrin) I attempted to run fsck against the file system but it responded that the e2fsprogs was out of date for the file system so I ran fsck.ldiskfs. The fsck.ldiskfs found some bad inodes and corrected them but on a subsequent run with the -n option (done to make sure it was clean) I started seeing a flood of "dirdata length set incorrectly" messages. I stopped it and was able to mount the FS but later the FS spontaneously unmounted.

What does this mean? Fortunately this file system is in pre-production and can be recreated (which is intended) but I'd like to know if this was caused by running fsck.ldiskfs since I did not see these messages on the first pass. The version of e2fsprogs (non-Redhat) is ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64. I have downloaded the wc4 version from the WC repo and installed it into a test image where I have rebooted the node into. I was able to use e2fsck to check the FS and I am using -fDy options but the "dirdata length set incorrectly" message continues to stream and has been going for more that an hour.

Any help would be appreciated.



 Comments   
Comment by Andreas Dilger [ 03/May/12 ]

The ldiskfsprogs package is from LLNL I think, but I assume it matches our e2fsprogs-1.41.90.wc3 version.

It is possible that running "e2fsck -fDy" (in particular the "-D" option, which is trying to compress and optimize the htree directory structure) is having some kind of bad interaction with the dirdata feature (which is storing the Lustre FID after each filename in the directory entry). The "dirdata" feature was added for Lustre 2.x, and is not currently present in the upstream e2fsprogs.

Unfortunately, the "dirdata length set incorrectly" message doesn't quite report enough information about how or why it thinks the length is bad. It seems that this problem is easily reproduced by running e2fsck with the "-D" option, but not in the case of a normal e2fsck run:

[root@sookie lustre-head]# e2fsck -fy /tmp/lustre-mdt1
e2fsck 1.41.90.wc3 (28-May-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

[root@sookie lustre-head]# e2fsck -fn /tmp/lustre-mdt1
e2fsck 1.41.90.wc3 (28-May-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
lustre-MDT0000: 128/100000 files (19.5% non-contiguous), 19545/50000 blocks

[root@sookie lustre-head]# e2fsck -fDy /tmp/lustre-mdt1 
e2fsck 1.41.90.wc3 (28-May-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

lustre-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
lustre-MDT0000: 128/100000 files (19.5% non-contiguous), 19542/50000 blocks

[root@sookie lustre-head]# e2fsck -fn /tmp/lustre-mdt1 
e2fsck 1.41.90.wc3 (28-May-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '.' in /PENDING (25002) dirdata length set incorrectly.
Clear? no

Entry '.' in /PENDING (25002) dirdata length set incorrectly.
Clear? no

Entry '..' in /PENDING (25002) dirdata length set incorrectly.
Clear? no

Entry '..' in /PENDING (25002) dirdata length set incorrectly.
Clear? no

Entry '.' in /ROOT (25003) dirdata length set incorrectly.
Clear? no

Entry '.' in /ROOT (25003) dirdata length set incorrectly.
Clear? no

Entry '.' in /ROOT (25003) dirdata length set incorrectly.
Clear? no

Entry '..' in /ROOT (25003) dirdata length set incorrectly.
Clear? no

Entry '..' in /ROOT (25003) dirdata length set incorrectly.
Clear? no

Entry '.' in /ROOT/.lustre (25004) dirdata length set incorrectly.
Clear? no

Entry '.' in /ROOT/.lustre (25004) dirdata length set incorrectly.
Clear? no

Entry '..' in /ROOT/.lustre (25004) dirdata length set incorrectly.
Clear? no

Entry '..' in /ROOT/.lustre (25004) dirdata length set incorrectly.
Clear? no

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
lustre-MDT0000: 128/100000 files (18.8% non-contiguous), 19542/50000 blocks

Fortunately, the Lustre FID information stored in the extended directory entry data is not required for proper operation (it is a performance optimization, and the FID will be retrieved from an inode extended attribute if not in the directory entry), but it isn't clear if the "e2fsck -D" run has left the filesystem in some inconsistent state that would confuse the MDS code.

In my testing, even running "e2fsck -fy" repeatedly does not fix the problem.

Comment by Peter Jones [ 03/May/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Joe Mervini [ 03/May/12 ]

I had the same thought about the -D option and ran the newer version of the e2fsck with the -fy option. It did clean up a lot of it but still remained in a couple of directories consistent with your last comment. I reran e2fsck a number of more times and it completed in a fairly timely manner though.

As I mentioned earlier, this file system is pre-production and was built on lustre 2.0.something from an alpha release of the TOSS OS. I am in the process of backing up the data that people want to keep and then am going to rebuild the file system with the current release code.

But I will take care to NOT use the optimize directory option to e2fsck on the MDT from now on.

Comment by Christopher Morrone [ 03/May/12 ]

It would have been lustre 2.1.x. We skipped over 2.0.x.

And yes, ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64 is from LLNL. We recently shared the patches to make ldiskfsprogs in LU-929.

Comment by Zhenyu Xu [ 07/May/12 ]

patch tracking at http://review.whamcloud.com/2661

I tested it locally, "e2fsck -fD" doesn't store dirent with erroneous dirdata w/ the patch.

Comment by Joe Mervini [ 09/May/12 ]

Just wanted to feedback some information regarding potential fallout from the e2fsck prior to the patch:

I ran the fsck with the -y option which attempted to correct the dirdata problem. After backing up and restoring all the user data it appears that all symlinks that were previously on the file system are gone.

Is it reasonable to assume the dirdata errors (there were hundreds) were referencing the symlinks since they are metadata only? All the real data seems to be intact.

Comment by Joe Mervini [ 09/May/12 ]

I wanted to test the assumption so on a like system I created a fresh lustre file system, created a directory on the newly created file system, and populated the directory with the content of /usr/bin. I then created a second directory and symlinked all the files in the first directory to the second.

I then stopped lustre on the MDS and first ran e2fsck -n on the MDT which came back clean. However, when I ran fsck -fy on the file system (I intentionally avoid the directory optimization flag to start) the fsck complained about corrupt extent headers on all of those symlinked files and removed them. I restarted lustre to verify that was indeed what happen.

I repeated this test a number of times so it is highly reproducible. In addition on the final test after the initial e2fsck -fy was run, I ran e2fsck -fDy and it finished cleanly (i.e., no dirdata length errors).

This is definitely a bug that should be escalated. Fortunately it was discovered preproduction in our case, but as is, it is extremely dangerous to run a file system check.

I have already attached the output from e2fsck for examination.

Comment by Zhenyu Xu [ 09/May/12 ]

what e2fsck version do you use?

I've tried my e2fsck, didn't hit the extent tree read error issue.

# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdc                187464     24796    152668  14% /mnt/ost1
/dev/sdd                187464     24768    152696  14% /mnt/ost2
test3@tcp:/lustre       374928     49564    305236  14% /mnt/lustre
/dev/sdb                149944     17388    122556  13% /mnt/mds1

# /bin/cp /bin/login /mnt/lustre && mkdir -p /mnt/lustre/dir && ln -f /mnt/lustre/login /mnt/lustre/dir/link && ln -sf /mnt/lustre/login /mnt/lustre/dir/sym

# umount /mnt/mds1

# e2fsck -n /dev/sdb
e2fsck 1.41.90.wc4 (01-Sep-2011)
lustre-MDT0000: clean, 106/100000 files, 16861/50000 blocks

# e2fsck -fy /dev/sdb
e2fsck 1.41.90.wc4 (01-Sep-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
lustre-MDT0000: 106/100000 files (2.8% non-contiguous), 16861/50000 blocks

Comment by Joe Mervini [ 09/May/12 ]

Sorry - I thought I had included that info. In this instance we are running:

[root@cmds1 ~]# rpm -qa |grep e2fs
e2fsprogs-libs-1.41.90.wc4-7.el6.x86_64
e2fsprogs-devel-1.41.90.wc4-7.el6.x86_64
e2fsprogs-1.41.90.wc4-7.el6.x86_64

The version of lustre (mkfs.lustre) is:

lustre-2.1.1-2chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64

The storage unit we are using is a DDN-3015 configured as a RAID10 array, fiber channel dual attached to 2 Dell R710 servers. Have you heard of any issues a similar configuration?

The only thing that I see that may be different between our two processes is that your link directory is a child of the main directory. In my test I had both directories at the same hierarchical level. I will test again with links in a child directory in the morning and feedback the results.

Comment by Joe Mervini [ 14/May/12 ]

I continue to see errors similar to the output that I sent last week on my file system. If I create a directory and populate it, then create another directory and symlink the files from the first directory to the second, if I run a fsck.ldiskfs against the MDT I get error like this for every link:

[root@cmds1 osc]# fsck.ldiskfs -fn /dev/sda >> /tmp/fsck.out
fsck.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011)
Pass 1: Checking inodes, blocks, and sizes
Error while reading over extent tree in inode 1051732234: Corrupt extent header
Clear inode? no
Pass 2: Checking directory structure
Symlink /ROOT/jamervi/bin/sfdp (inode #1051733038) is invalid.
Clear? no

I have run this against another diskless image and have been able to duplicate this multiple times. I thought that perhaps dm-multipath could be a contributor and so I flush the maps and ran fsck.ldiskfs against the sd device with the same results. In this instance here are the particulars of the lustre software environment:

[root@cmds1 osc]# rpm -qa|grep kernel
dracut-kernel-004-256.el6_2.1.noarch
kernel-headers-2.6.32-220.7.1.7chaos.ch5.x86_64
kernel-2.6.32-220.7.1.7chaos.ch5.x86_64
kernel-debuginfo-common-x86_64-2.6.32-220.7.1.7chaos.ch5.x86_64
kernel-firmware-2.6.32-220.7.1.7chaos.ch5.x86_64
kernel-debuginfo-2.6.32-220.7.1.7chaos.ch5.x86_64
kernel-devel-2.6.32-220.7.1.7chaos.ch5.x86_64

[root@cmds1 osc]# rpm -qa |grep lustre
lustre-tools-llnl-1.4-2.ch5.x86_64
lustre-modules-2.1.1-3chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64
lustre-2.1.1-3chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64

[root@cmds1 osc]# rpm -qa |grep ldiskfs
ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64
ldiskfs-devel-4.0.6-0_2.6.32_220.7.1.7chaos.ch5.x86_64
ldiskfs-4.0.6-0_2.6.32_220.7.1.7chaos.ch5.x86_64

[root@cmds1 osc]# rpm -qa |grep e2fs
e2fsprogs-1.41.12-11.el6.x86_64
e2fsprogs-libs-1.41.12-11.el6.x86_64
e2fsprogs-devel-1.41.12-11.el6.x86_64

WRT the location of the link directory that doesn't matter. Any symlink has the corrupted extent header according to fsck and want to remove it along with the inode. In the most current tests I have been using '-fn' flags to fsck so that nothing happens to the file system. I can then remount the file system and see all the files including symlinks from the client.

Should I push this to Livermore?

Comment by Joe Mervini [ 15/May/12 ]

The problem appears to be TOSS2 specific.

As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related.

I will be opening a bugzilla bug with Livermore.

Comment by Joe Mervini [ 15/May/12 ]

A little more data: No dirdata length errors with -fDy option on RHEL-6.1

[root@cmds1 osc]# e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
e2fsck 1.41.90.wc4 (01-Sep-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
scratch2-MDT0000: 2910/1463654128 files (0.2% non-contiguous), 183179282/731824160 blocks

Comment by Andreas Dilger [ 15/May/12 ]

Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem:

# dumpe2fs -h /tmp/lustre-mdt1  | grep feature
dumpe2fs 1.41.90.wc3 (28-May-2011)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink
Journal features:         (none)

Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk).

Comment by Andreas Dilger [ 15/May/12 ]

Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed.

Comment by Joe Mervini [ 15/May/12 ]

Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file
huge_file uninit_bg dir_nlink extra_isize

On the RHEL-6.1 created MDT these are the features:
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink

I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there.

[root@cmds1 ~]# mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000

Permanent disk data:
Target: scratch2-MDTffff
Index: unassigned
Lustre FS: scratch2
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: failover.node=10.196.135.143@o2ib1

device size = 2858688MB
formatting backing filesystem ldiskfs on /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
target name scratch2-MDTffff
4k blocks 731824160
options -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
cmd: mkfs.ldiskfs -j -b 4096 -L scratch2-MDTffff -J size=400 -I 512 -i 2048 -O dirdata,uninit_bg,mmp,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 731824160
mkfs.ldiskfs 1.41.90.3chaos.wc3 (28-May-2011)
Discarding device blocks: failed - Operation not supported
Filesystem label=scratch2-MDTffff
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1463654128 inodes, 731824160 blocks
36591208 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2880079872
44689 block groups
16376 blocks per group, 16376 fragments per group
32752 inodes per group
Superblock backups stored on blocks:
16376, 49128, 81880, 114632, 147384, 409400, 442152, 802424, 1326456,
2047000, 3979368, 5616968, 10235000, 11938104, 35814312, 39318776,
51175000, 107442936, 255875000, 275231432, 322328808

Allocating group tables: done
Writing inode tables: done
Creating journal (102400 blocks): done
Multiple mount protection has been enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 0 mounts or
0 days, whichever comes first. Use tunefs.ldiskfs -c or -i to override.
Writing CONFIGS/mountdata

[root@cmds1 ~]# dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
dumpe2fs 1.41.12 (17-May-2010)
Filesystem volume name: scratch2-MDTffff
Last mounted on: /ram/tmp/mntBrCPMe
Filesystem UUID: 48d57c6d-8156-4eb7-bcf8-a298fc0f7af9
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1463654128
Block count: 731824160
Reserved block count: 36591208
Free blocks: 548645355
Free inodes: 1463654115
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1024
Blocks per group: 16376
Fragments per group: 16376
Inodes per group: 32752
Inode blocks per group: 4094
Flex block group size: 16
Filesystem created: Tue May 15 15:31:37 2012
Last mount time: Tue May 15 16:02:53 2012
Last write time: Tue May 15 16:02:55 2012
Mount count: 1
Maximum mount count: 20
Last checked: Tue May 15 15:31:37 2012
Check interval: 0 (<none>)
Lifetime writes: 698 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: f6500d19-ed37-48f6-a446-e8485a2f9edf
Journal backup: inode blocks
Journal features: (none)
Journal size: 400M
Journal length: 102400
Journal sequence: 0x00000005
Journal start: 0

Comment by Andreas Dilger [ 15/May/12 ]

Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs

{lots of options}

", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST.

In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled.

It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features.

Comment by Joe Mervini [ 15/May/12 ]

All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it.

Comment by Andreas Dilger [ 15/May/12 ]

In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT.

I think a combination of factors is required here, to avoid this problem for other filesystems:

  • explicitly disable "extents" for MDT filesystems in mkfs_lustre.c (should go into 2.1.x)
  • fix e2fsck so that it does not corrupt extent-mapped symlinks (this may already be fixed in newer e2fsprogs)
  • land the OI Scrub feature for 2.3 (this is likely too much of a "feature" for 2.1.x)
Comment by Joe Mervini [ 15/May/12 ]

I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

So is it your opinion that we should start from scratch again while we have the chance?

Comment by Joe Mervini [ 15/May/12 ]

To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors.

Comment by Andreas Dilger [ 16/May/12 ]

FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed.

The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661.

Comment by Christopher Morrone [ 16/May/12 ]

Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs {lots of options}", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST.

In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled.

I don't think that we are modifying mkfs.lustre. We just configure lustre "--with-ldiskfsprogs", but that code is entirely in the upstream lustre.

The ldiskfsprogs's mkfs.ldiskfs does not intentionally change the default filesystem type from ext2 to ext4. The patch that introduces the ldiskfsprogs changes is here:

http://review.whamcloud.com/2582

Comment by Christopher Morrone [ 16/May/12 ]

Ned pointed out to me that we are adding an "/etc/mkfs.ldiskfs.conf" file. Here is an excerpt:

[fs_types]
       ext3 = {
               features = has_journal
       }
       ldiskfs = {
               features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
               auto_64-bit_support = 1
               inode_size = 256
       }
Comment by Andreas Dilger [ 17/May/12 ]

So that explains why the "extent" option was set for the MDT filesystem. That said, with the patch in http://review.whamcloud.com/2798 it will explicitly unset the extents feature for the MDT filesystem to avoid this problem for new filesystems.

We still need to understand/address the extents symlink problem. I see commits related to symlinks with extents (below), but it isn't clear whether the problem only applies to short symlinks, or long symlinks as well? Given that there are reports of many symlinks being deleted, I would suspect that the problem is with fast symlinks, and somehow the MDT is setting the "EXTENTS_FL" for symlinks, when it shouldn't be doing that.

Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Mar 13 23:13:18 2008 -0400

    e2fsck: Check for fast symlinks that have EXTENTS_FL set
    
    These shouldn't show up in the wild, but if they do, e2fsck will offer
    to clear them.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit 7cadc57780f3e3e8e644e8976e11a336902d4a25
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Mar 13 23:05:00 2008 -0400

    e2fsck: Support long symlinks which use extents
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Comment by Joe Mervini [ 17/May/12 ]

Not to detour from the subject of this ticket, but could you explain the difference between fast, short and long symlinks? I wanted to keep my ignorance on the down-low by checking the web and with several people here, but no one seems to know.

Comment by Andreas Dilger [ 18/May/12 ]

Sorry, I wasn't really using my terms consistently. The fast symlinks are those stored directly in the inode, while slow symlinks are stored in an external block. These correspond to short and long symlinks (the boundary being at 60 bytes).

I think the issue may be that if the symlink is stored in the inode (fast symlink) but the EXTENTS flag is set, that this may incorrectly be interpreting the symlink text as extent data, and e2fsck considers this a corrupt inode.

To test this theory, an MDT filesystem with extents enabled should get some symlinks created, then mounted as ldiskfs and lsattr run on the symlinks to see if the extent flag is set. Alternately, debugfs "stat" can be used ok the inodes to print the flags.

Comment by Joe Mervini [ 18/May/12 ]

I ran the test (mostly out of curiosity and for my own understanding). When I ran lsattr against a linked file I got operation not supported:

root@cmds1 bin2]# lsattr /mnt/ROOT/jamervi/bin/passwd
------------e /mnt/ROOT/jamervi/bin/passwd
[root@cmds1 bin2]# lsattr /mnt/ROOT/jamervi/bin2/passwd
lsattr: Operation not supported While reading flags on /mnt/ROOT/jamervi/bin2/passwd

But when I ran debugfs (and I really don't know how to interpret the output) it appears to me that there are not extents associated with the symlink. At least none are explicitly called out. Am I interpreting this correctly?

[root@cmds1 bin2]# debugfs /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000
debugfs 1.41.12 (17-May-2010)
debugfs: cd /mnt/ROOT
/mnt/ROOT: File not found by ext2_lookup
debugfs: cd /mnt
/mnt: File not found by ext2_lookup
debugfs: cd ROOT
debugfs: cd jamervi
debugfs: cd bin
debugfs: stat passwd
Inode: 405079025 Type: regular Mode: 0755 Flags: 0x80000
Generation: 3906041213 Version: 0x00000001:000011dd
User: 0 Group: 0 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4fb68353:00000000 – Fri May 18 11:13:55 2012
atime: 0x4fb68353:00000000 – Fri May 18 11:13:55 2012
mtime: 0x4fb68353:00000000 – Fri May 18 11:13:55 2012
crtime: 0x4fb68353:ad6a06dc – Fri May 18 11:13:55 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 00 04 00 00 02 00 00 00 f0 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 00 00 00 " (64)
link = "df f1 ea 11 01 00 00 00 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 00 00 00 02 00 00 04 00 00 00 00 02 00 00 00 00 70 61 73 73 77 64 " (48)
lov = "d0 0b d1 0b 01 00 00 00 f0 08 00 00 00 00 00 00 00 04 00 00 02 00 00 00 00 00 10 00 01 00 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2e 0
0 00 00 " (56)
EXTENTS:
debugfs: cd ../bin2
debugfs: stat passwd
Inode: 405082862 Type: symlink Mode: 0777 Flags: 0x80000
Generation: 3906045050 Version: 0x00000001:000026ee
User: 0 Group: 0 Size: 13
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4fb68377:00000000 – Fri May 18 11:14:31 2012
atime: 0x4fb683cb:1cb0d110 – Fri May 18 11:15:55 2012
mtime: 0x4fb68377:00000000 – Fri May 18 11:14:31 2012
crtime: 0x4fb68377:6c55b22c – Fri May 18 11:14:31 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 00 04 00 00 02 00 00 00 ed 17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 00 00 00 " (64)
link = "df f1 ea 11 01 00 00 00 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 00 00 00 02 00 00 04 00 00 00 00 04 00 00 00 00 70 61 73 73 77 64 " (48)
Fast_link_dest: ../bin/passwd

Comment by Andreas Dilger [ 18/May/12 ]

The "Flags: 0x80000" line maps to EXT4_EXTENTS_FL, so in fact it seems this is being set/inherited incorrectly on the MDT fast symlinks. Note "Fast_link_dest: ../bin/passwd" indicates that the symlink is indeed stored inside the inode.

My first guess is a defect in the osd-ldiskfs code that is unconditionally setting LDISKFS_EXTENTS_FL on all inodes, when this should only be set on regular files.

Comment by Andreas Dilger [ 05/Jun/12 ]

The e2fsck fix for this is included into the rebased e2fsprogs-1.42.3.wc1 build, currently undergoing testing.

Comment by Christopher Morrone [ 05/Jun/12 ]

I wee the v1.42.3-lustre branch, but not the 1.42.3.wc1 tag.

Comment by Andreas Dilger [ 05/Jun/12 ]

The v1.42.3.wc1 tag is on the master-lustre branch.

Comment by Christopher Morrone [ 05/Jun/12 ]

Whoops, I needed an explicit "fetch --tags". Must have that remote configured wrong.

Comment by Christopher Morrone [ 07/Jun/12 ]

Ah, I see what happened, the v1.42.3.wc1 tag is actually a different commit than the commit on master-lustre.

* 9a5ba10 (tag: v1.42.3.wc1) e2fsck: allow checking on mounted root filesystem
| * f7a92f9 (wc/master-lustre) e2fsck: allow checking on mounted root filesystem
|/  

You might want to just force-update master-lustre to be the commit that v1.42.3.wc1 tags. It looks like the only difference is the addition of the gerrit commit ID in the commit message in the tagged one.

So where does this leave us? Do we still think that something in osd-ldiskfs or somewhere else in lustre needs fixing, or do we no believe that e2fsck is entirely to blame?

Comment by Andreas Dilger [ 07/Jun/12 ]

The fix for e2fsck breaking dirdata with "-fD" is fixed in 1.42.3.wc1. The mkfs_lustre.c code now also explicitly disables extents (in b2_1 and master), which will avoid this problem for new filesystems in the future.

What still appears to need fixing is the use of the EXT4_EXTENTS_FL on short symlinks in the osd-ldiskfs code. This would need a special conf-sanity.sh test that tries to format the MDT with extents enabled, since we don't do that by default (specifying '--mkfsoptions="-O extents"' would override the "^extents" option specified internal to mkfs_lustre.c).

Comment by James A Simmons [ 16/Aug/16 ]

Old ticket for unsupported version

Generated at Sat Feb 10 01:16:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.