[LU-1366] getting "dirdata length set incorrectly" running e2fsck Created: 03/May/12 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.1.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Joe Mervini | Assignee: | Zhenyu Xu |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
DDN SFA10k - Dell R710 - TOSS2.0 OS release |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Sub-Tasks: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 4619 | ||||||||||||
| Description |
|
After adding a network to the file system and adding the IP for the failover node to the MDS it wouldn't mount. (I later found that --param failnode= is no longer valid - much to my chagrin) I attempted to run fsck against the file system but it responded that the e2fsprogs was out of date for the file system so I ran fsck.ldiskfs. The fsck.ldiskfs found some bad inodes and corrected them but on a subsequent run with the -n option (done to make sure it was clean) I started seeing a flood of "dirdata length set incorrectly" messages. I stopped it and was able to mount the FS but later the FS spontaneously unmounted. What does this mean? Fortunately this file system is in pre-production and can be recreated (which is intended) but I'd like to know if this was caused by running fsck.ldiskfs since I did not see these messages on the first pass. The version of e2fsprogs (non-Redhat) is ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64. I have downloaded the wc4 version from the WC repo and installed it into a test image where I have rebooted the node into. I was able to use e2fsck to check the FS and I am using -fDy options but the "dirdata length set incorrectly" message continues to stream and has been going for more that an hour. Any help would be appreciated. |
| Comments |
| Comment by Andreas Dilger [ 03/May/12 ] |
|
The ldiskfsprogs package is from LLNL I think, but I assume it matches our e2fsprogs-1.41.90.wc3 version. It is possible that running "e2fsck -fDy" (in particular the "-D" option, which is trying to compress and optimize the htree directory structure) is having some kind of bad interaction with the dirdata feature (which is storing the Lustre FID after each filename in the directory entry). The "dirdata" feature was added for Lustre 2.x, and is not currently present in the upstream e2fsprogs. Unfortunately, the "dirdata length set incorrectly" message doesn't quite report enough information about how or why it thinks the length is bad. It seems that this problem is easily reproduced by running e2fsck with the "-D" option, but not in the case of a normal e2fsck run: [root@sookie lustre-head]# e2fsck -fy /tmp/lustre-mdt1 e2fsck 1.41.90.wc3 (28-May-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information [root@sookie lustre-head]# e2fsck -fn /tmp/lustre-mdt1 e2fsck 1.41.90.wc3 (28-May-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-MDT0000: 128/100000 files (19.5% non-contiguous), 19545/50000 blocks [root@sookie lustre-head]# e2fsck -fDy /tmp/lustre-mdt1 e2fsck 1.41.90.wc3 (28-May-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 3A: Optimizing directories Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** lustre-MDT0000: 128/100000 files (19.5% non-contiguous), 19542/50000 blocks [root@sookie lustre-head]# e2fsck -fn /tmp/lustre-mdt1 e2fsck 1.41.90.wc3 (28-May-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Entry '.' in /PENDING (25002) dirdata length set incorrectly. Clear? no Entry '.' in /PENDING (25002) dirdata length set incorrectly. Clear? no Entry '..' in /PENDING (25002) dirdata length set incorrectly. Clear? no Entry '..' in /PENDING (25002) dirdata length set incorrectly. Clear? no Entry '.' in /ROOT (25003) dirdata length set incorrectly. Clear? no Entry '.' in /ROOT (25003) dirdata length set incorrectly. Clear? no Entry '.' in /ROOT (25003) dirdata length set incorrectly. Clear? no Entry '..' in /ROOT (25003) dirdata length set incorrectly. Clear? no Entry '..' in /ROOT (25003) dirdata length set incorrectly. Clear? no Entry '.' in /ROOT/.lustre (25004) dirdata length set incorrectly. Clear? no Entry '.' in /ROOT/.lustre (25004) dirdata length set incorrectly. Clear? no Entry '..' in /ROOT/.lustre (25004) dirdata length set incorrectly. Clear? no Entry '..' in /ROOT/.lustre (25004) dirdata length set incorrectly. Clear? no Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors ********** lustre-MDT0000: 128/100000 files (18.8% non-contiguous), 19542/50000 blocks Fortunately, the Lustre FID information stored in the extended directory entry data is not required for proper operation (it is a performance optimization, and the FID will be retrieved from an inode extended attribute if not in the directory entry), but it isn't clear if the "e2fsck -D" run has left the filesystem in some inconsistent state that would confuse the MDS code. In my testing, even running "e2fsck -fy" repeatedly does not fix the problem. |
| Comment by Peter Jones [ 03/May/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Joe Mervini [ 03/May/12 ] |
|
I had the same thought about the -D option and ran the newer version of the e2fsck with the -fy option. It did clean up a lot of it but still remained in a couple of directories consistent with your last comment. I reran e2fsck a number of more times and it completed in a fairly timely manner though. As I mentioned earlier, this file system is pre-production and was built on lustre 2.0.something from an alpha release of the TOSS OS. I am in the process of backing up the data that people want to keep and then am going to rebuild the file system with the current release code. But I will take care to NOT use the optimize directory option to e2fsck on the MDT from now on. |
| Comment by Christopher Morrone [ 03/May/12 ] |
|
It would have been lustre 2.1.x. We skipped over 2.0.x. And yes, ldiskfsprogs-1.41.90.3chaos.wc3-0.ch5.x86_64 is from LLNL. We recently shared the patches to make ldiskfsprogs in LU-929. |
| Comment by Zhenyu Xu [ 07/May/12 ] |
|
patch tracking at http://review.whamcloud.com/2661 I tested it locally, "e2fsck -fD" doesn't store dirent with erroneous dirdata w/ the patch. |
| Comment by Joe Mervini [ 09/May/12 ] |
|
Just wanted to feedback some information regarding potential fallout from the e2fsck prior to the patch: I ran the fsck with the -y option which attempted to correct the dirdata problem. After backing up and restoring all the user data it appears that all symlinks that were previously on the file system are gone. Is it reasonable to assume the dirdata errors (there were hundreds) were referencing the symlinks since they are metadata only? All the real data seems to be intact. |
| Comment by Joe Mervini [ 09/May/12 ] |
|
I wanted to test the assumption so on a like system I created a fresh lustre file system, created a directory on the newly created file system, and populated the directory with the content of /usr/bin. I then created a second directory and symlinked all the files in the first directory to the second. I then stopped lustre on the MDS and first ran e2fsck -n on the MDT which came back clean. However, when I ran fsck -fy on the file system (I intentionally avoid the directory optimization flag to start) the fsck complained about corrupt extent headers on all of those symlinked files and removed them. I restarted lustre to verify that was indeed what happen. I repeated this test a number of times so it is highly reproducible. In addition on the final test after the initial e2fsck -fy was run, I ran e2fsck -fDy and it finished cleanly (i.e., no dirdata length errors). This is definitely a bug that should be escalated. Fortunately it was discovered preproduction in our case, but as is, it is extremely dangerous to run a file system check. I have already attached the output from e2fsck for examination. |
| Comment by Zhenyu Xu [ 09/May/12 ] |
|
what e2fsck version do you use? I've tried my e2fsck, didn't hit the extent tree read error issue. # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdc 187464 24796 152668 14% /mnt/ost1 /dev/sdd 187464 24768 152696 14% /mnt/ost2 test3@tcp:/lustre 374928 49564 305236 14% /mnt/lustre /dev/sdb 149944 17388 122556 13% /mnt/mds1 # /bin/cp /bin/login /mnt/lustre && mkdir -p /mnt/lustre/dir && ln -f /mnt/lustre/login /mnt/lustre/dir/link && ln -sf /mnt/lustre/login /mnt/lustre/dir/sym # umount /mnt/mds1 # e2fsck -n /dev/sdb e2fsck 1.41.90.wc4 (01-Sep-2011) lustre-MDT0000: clean, 106/100000 files, 16861/50000 blocks # e2fsck -fy /dev/sdb e2fsck 1.41.90.wc4 (01-Sep-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-MDT0000: 106/100000 files (2.8% non-contiguous), 16861/50000 blocks |
| Comment by Joe Mervini [ 09/May/12 ] |
|
Sorry - I thought I had included that info. In this instance we are running: [root@cmds1 ~]# rpm -qa |grep e2fs The version of lustre (mkfs.lustre) is: lustre-2.1.1-2chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64 The storage unit we are using is a DDN-3015 configured as a RAID10 array, fiber channel dual attached to 2 Dell R710 servers. Have you heard of any issues a similar configuration? The only thing that I see that may be different between our two processes is that your link directory is a child of the main directory. In my test I had both directories at the same hierarchical level. I will test again with links in a child directory in the morning and feedback the results. |
| Comment by Joe Mervini [ 14/May/12 ] |
|
I continue to see errors similar to the output that I sent last week on my file system. If I create a directory and populate it, then create another directory and symlink the files from the first directory to the second, if I run a fsck.ldiskfs against the MDT I get error like this for every link: [root@cmds1 osc]# fsck.ldiskfs -fn /dev/sda >> /tmp/fsck.out I have run this against another diskless image and have been able to duplicate this multiple times. I thought that perhaps dm-multipath could be a contributor and so I flush the maps and ran fsck.ldiskfs against the sd device with the same results. In this instance here are the particulars of the lustre software environment: [root@cmds1 osc]# rpm -qa|grep kernel [root@cmds1 osc]# rpm -qa |grep lustre [root@cmds1 osc]# rpm -qa |grep ldiskfs [root@cmds1 osc]# rpm -qa |grep e2fs WRT the location of the link directory that doesn't matter. Any symlink has the corrupted extent header according to fsck and want to remove it along with the inode. In the most current tests I have been using '-fn' flags to fsck so that nothing happens to the file system. I can then remount the file system and see all the files including symlinks from the client. Should I push this to Livermore? |
| Comment by Joe Mervini [ 15/May/12 ] |
|
The problem appears to be TOSS2 specific. As an experiment I created a RHEL6.1 image and installed the WC release of lustre. Repeating the tests that I performed under TOSS did NOT produce the corrupt extent headers for linked files. I have a big concern that the problem is mkfs.lustre related. I will be opening a bugzilla bug with Livermore. |
| Comment by Joe Mervini [ 15/May/12 ] |
|
A little more data: No dirdata length errors with -fDy option on RHEL-6.1 [root@cmds1 osc]# e2fsck -fDy /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 scratch2-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** |
| Comment by Andreas Dilger [ 15/May/12 ] |
|
Is it possible that the TOSS version of mkfs.lustre is setting the "extents" feature for the MDT filesystem? For a test 2.x filesystem I have here (current git master and e2fsprogs-1.41.90.wc3-7.fc13.x86_64, but I don't think mkfs_lustre.c has changed recently) there is no "extents" feature enabled on the MDT filesystem: # dumpe2fs -h /tmp/lustre-mdt1 | grep feature dumpe2fs 1.41.90.wc3 (28-May-2011) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink Journal features: (none) Having extents enabled is not useful for the MDT, and may even hurt performance because there is more metadata overhead for each block (it is rare that directory blocks are allocated contiguously on disk). |
| Comment by Andreas Dilger [ 15/May/12 ] |
|
Even if the "extents" feature on the MDT is the root cause, this still be a serious issue in the e2fsck code that needs to be addressed. |
| Comment by Joe Mervini [ 15/May/12 ] |
|
Andreas - good call. I just checked the file system that is running with TOSS and here are the features described for the MDT: Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file On the RHEL-6.1 created MDT these are the features: I did not intentionally set extents. So as a sanity check I re-created the MDT on the test machine. Even though it is not included in the command line or appear in the options for mkfs.lustre in verbose mode the subsequent dumpe2fs definitely shows it as being there. [root@cmds1 ~]# mkfs.lustre --mgs --mdt --reformat --verbose --fsname=scratch2 --failnode=10.196.135.143@o2ib1 /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 Permanent disk data: device size = 2858688MB Allocating group tables: done This filesystem will be automatically checked every 0 mounts or [root@cmds1 ~]# dumpe2fs -h /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 |
| Comment by Andreas Dilger [ 15/May/12 ] |
|
Thinking about this further, I think I understand the root cause. The standard mkfs_lustre.c will call "mke2fs {lots of options}", which starts with an ext2 filesystem and enables the individual features needed to make the filesystem ext4. For the MDT filesystem, it does not turn on the "extents" feature, but it does for the OST. In the TOSS ldiskfsprogs, I suspect that "mkfs.ldiskfs" starts with an "ext4" filesystem, and (re)sets the same options, but for the MDT it already has "extents" enabled. It makes sense to explicitly disable the extents feature in mkfs_lustre.c for MDT filesystems, since they provide absolutely no benefit, and may instead be hurting performance. That is a simple matter of appending ",^extents" to the list of MDT features. |
| Comment by Joe Mervini [ 15/May/12 ] |
|
All this being said, is there a way to back out the extent feature without reformatting the file system? As before I would prefer to deal with the pain now as opposed to down the road when there's a petabyte of data with no place to move it. |
| Comment by Andreas Dilger [ 15/May/12 ] |
|
In the ldiskfs code (but not in upstream ext4) it is possible to mount a filesystem with the "noextents" mount option, so that new files/directories are not created with the extent flag set. This does not affect existing files/directories. There isn't really a mechanism to "migrate" such files to non-extent files without essentially a file-level backup/restore, but that will not currently work for 2.x MDT filesystems due to the Object Index becoming inconsistent. Only block-device level backup/restore is currently functional for 2.x MDT filesystems. The OI Scrub feature is nearing completion and will be landed for the 2.3 release, which will again allow file-level backup/restore for the MDT. I think a combination of factors is required here, to avoid this problem for other filesystems:
|
| Comment by Joe Mervini [ 15/May/12 ] |
|
I was able to reformat the MDT with mkfsoptions="-O ^extent" with the TOSS bits. It doesn't show up in the features of dumpe2fs but there is FEATURE_I8 and _I12 that I haven't found an reference for: Filesystem features: has_journal ext_attr resize_inode dir_index filetype FEATURE_I8 flex_bg FEATURE_I12 sparse_super large_file huge_file uninit_bg dir_nlink extra_isize So is it your opinion that we should start from scratch again while we have the chance? |
| Comment by Joe Mervini [ 15/May/12 ] |
|
To be thorough I created the rest of the file system after reformatting the the MDT and reran the symlink test. LLNL's fsck.ldiskfs -fy passed without errors. |
| Comment by Andreas Dilger [ 16/May/12 ] |
|
FEATURE_I8 is "mmp" and FEATURE_I12 is "dir_data". These are not being printed because you are using the stock "debugfs" instead of "debugfs.ldiskfs" (or whatever the equivalent is), which doesn't know what these features are called. That is expected when using a separate ldiskfsprogs and leaving the stock e2fsprogs installed. The "fsck.ldiskfs -fDy" problem will still exist, even without the extents option, unless you apply the patch from http://review.whamcloud.com/2661. |
| Comment by Christopher Morrone [ 16/May/12 ] |
I don't think that we are modifying mkfs.lustre. We just configure lustre "--with-ldiskfsprogs", but that code is entirely in the upstream lustre. The ldiskfsprogs's mkfs.ldiskfs does not intentionally change the default filesystem type from ext2 to ext4. The patch that introduces the ldiskfsprogs changes is here: |
| Comment by Christopher Morrone [ 16/May/12 ] |
|
Ned pointed out to me that we are adding an "/etc/mkfs.ldiskfs.conf" file. Here is an excerpt: [fs_types]
ext3 = {
features = has_journal
}
ldiskfs = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
auto_64-bit_support = 1
inode_size = 256
}
|
| Comment by Andreas Dilger [ 17/May/12 ] |
|
So that explains why the "extent" option was set for the MDT filesystem. That said, with the patch in http://review.whamcloud.com/2798 it will explicitly unset the extents feature for the MDT filesystem to avoid this problem for new filesystems. We still need to understand/address the extents symlink problem. I see commits related to symlinks with extents (below), but it isn't clear whether the problem only applies to short symlinks, or long symlinks as well? Given that there are reports of many symlinks being deleted, I would suspect that the problem is with fast symlinks, and somehow the MDT is setting the "EXTENTS_FL" for symlinks, when it shouldn't be doing that. Author: Theodore Ts'o <tytso@mit.edu>
Date: Thu Mar 13 23:13:18 2008 -0400
e2fsck: Check for fast symlinks that have EXTENTS_FL set
These shouldn't show up in the wild, but if they do, e2fsck will offer
to clear them.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
commit 7cadc57780f3e3e8e644e8976e11a336902d4a25
Author: Theodore Ts'o <tytso@mit.edu>
Date: Thu Mar 13 23:05:00 2008 -0400
e2fsck: Support long symlinks which use extents
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
| Comment by Joe Mervini [ 17/May/12 ] |
|
Not to detour from the subject of this ticket, but could you explain the difference between fast, short and long symlinks? I wanted to keep my ignorance on the down-low by checking the web and with several people here, but no one seems to know. |
| Comment by Andreas Dilger [ 18/May/12 ] |
|
Sorry, I wasn't really using my terms consistently. The fast symlinks are those stored directly in the inode, while slow symlinks are stored in an external block. These correspond to short and long symlinks (the boundary being at 60 bytes). I think the issue may be that if the symlink is stored in the inode (fast symlink) but the EXTENTS flag is set, that this may incorrectly be interpreting the symlink text as extent data, and e2fsck considers this a corrupt inode. To test this theory, an MDT filesystem with extents enabled should get some symlinks created, then mounted as ldiskfs and lsattr run on the symlinks to see if the extent flag is set. Alternately, debugfs "stat" can be used ok the inodes to print the flags. |
| Comment by Joe Mervini [ 18/May/12 ] |
|
I ran the test (mostly out of curiosity and for my own understanding). When I ran lsattr against a linked file I got operation not supported: root@cmds1 bin2]# lsattr /mnt/ROOT/jamervi/bin/passwd But when I ran debugfs (and I really don't know how to interpret the output) it appears to me that there are not extents associated with the symlink. At least none are explicitly called out. Am I interpreting this correctly? [root@cmds1 bin2]# debugfs /dev/mapper/3600c0ff00011bdb4b12c0b4f01000000 |
| Comment by Andreas Dilger [ 18/May/12 ] |
|
The "Flags: 0x80000" line maps to EXT4_EXTENTS_FL, so in fact it seems this is being set/inherited incorrectly on the MDT fast symlinks. Note "Fast_link_dest: ../bin/passwd" indicates that the symlink is indeed stored inside the inode. My first guess is a defect in the osd-ldiskfs code that is unconditionally setting LDISKFS_EXTENTS_FL on all inodes, when this should only be set on regular files. |
| Comment by Andreas Dilger [ 05/Jun/12 ] |
|
The e2fsck fix for this is included into the rebased e2fsprogs-1.42.3.wc1 build, currently undergoing testing. |
| Comment by Christopher Morrone [ 05/Jun/12 ] |
|
I wee the v1.42.3-lustre branch, but not the 1.42.3.wc1 tag. |
| Comment by Andreas Dilger [ 05/Jun/12 ] |
|
The v1.42.3.wc1 tag is on the master-lustre branch. |
| Comment by Christopher Morrone [ 05/Jun/12 ] |
|
Whoops, I needed an explicit "fetch --tags". Must have that remote configured wrong. |
| Comment by Christopher Morrone [ 07/Jun/12 ] |
|
Ah, I see what happened, the v1.42.3.wc1 tag is actually a different commit than the commit on master-lustre. * 9a5ba10 (tag: v1.42.3.wc1) e2fsck: allow checking on mounted root filesystem | * f7a92f9 (wc/master-lustre) e2fsck: allow checking on mounted root filesystem |/ You might want to just force-update master-lustre to be the commit that v1.42.3.wc1 tags. It looks like the only difference is the addition of the gerrit commit ID in the commit message in the tagged one. So where does this leave us? Do we still think that something in osd-ldiskfs or somewhere else in lustre needs fixing, or do we no believe that e2fsck is entirely to blame? |
| Comment by Andreas Dilger [ 07/Jun/12 ] |
|
The fix for e2fsck breaking dirdata with "-fD" is fixed in 1.42.3.wc1. The mkfs_lustre.c code now also explicitly disables extents (in b2_1 and master), which will avoid this problem for new filesystems in the future. What still appears to need fixing is the use of the EXT4_EXTENTS_FL on short symlinks in the osd-ldiskfs code. This would need a special conf-sanity.sh test that tries to format the MDT with extents enabled, since we don't do that by default (specifying '--mkfsoptions="-O extents"' would override the "^extents" option specified internal to mkfs_lustre.c). |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Old ticket for unsupported version |