/bin/ls gets Input/output error (LU-2627)

[LU-2638] corruption of MDT ".." entry in some ldiskfs directories Created: 17/Jan/13  Updated: 04/May/15  Resolved: 21/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.3, Lustre 2.1.5
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Technical task Priority: Blocker
Reporter: Andreas Dilger Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: LB
Environment:

The MDT is formatted/modified to have the "extents" feature enabled, but I'm not sure if that is relevant for this part of the problem.


Attachments: Text File 0001-LU-2638-osd-ldiskfs-clean-up-insertion-of-.-FID.patch    
Rank (Obsolete): 6170

 Description   

It appears that there is a systematic corruption of the ".." entry in the directory, possibly only affecting htree directories when the "dirdata" feature is enabled.

e2fsck 1.42.6.wc2 (10-Dec-2012)
nbp1-MDT0000 has been mounted 121 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
[...snip...]
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Second entry 'Header' (inode=40967687) in directory inode 40967686 should be '..'
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Second entry 'Header' (inode=40971040) in directory inode 40971039 should be '..'
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Second entry 'Header' (inode=44588784) in directory inode 44588782 should be '..'
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
Fix? no
Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Second entry 'Header' (inode=47195750) in directory inode 47195749 should be '..'
Fix? no
[...snip...]
Pass 3: Checking directory connectivity
'..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Fix? no
'..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Fix? no
'..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Fix? no
'..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
Fix? no
[...snip...]


 Comments   
Comment by Andreas Dilger [ 17/Jan/13 ]

I'm not sure of the root cause of this corruption yet, though it might relate to renaming a directory in an upgraded 1.8 filesystem. Fan Yong has a patch for that in http://review.whamcloud.com/4899, but I'm not yet sure if it is the same problem.

Comment by nasf (Inactive) [ 17/Jan/13 ]

There is a bug in old Lustre-2.x: when the ".." is re-insert for rename, it does not check whether there are enough space to hold FID-in-dirent, if no, then it will crash the whole directory. Such issue has been fixed in the patch http://review.whamcloud.com/4899.

Comment by Andreas Dilger [ 21/Jan/13 ]

I'm looking at this also in the context of the changes in http://review.whamcloud.com/#patch,sidebyside,5134,2,lustre/osd-ldiskfs/osd_handler.c and also LU-838.

Looking at this code, it is actually quite twisty:

  • osd_ea_add_rec() calls osd_add_dot_dotdot() for inserting "." and ".."
    • osd_add_dot_dotdot()
      • adding "." only sets a flag that "." was "inserted"
      • adding ".." fails if "." was not "inserted" first
      • adding ".." gets FID-in-dirent (ldp) if not IGIF FID, though I can't imagine when a newly created directory IS an IGIF FID?
      • adding ".." calls ldiskfs_add_dot_dotdot() (ldiskfs patched function) if directory is newly created (this will always work correctly)
      • adding ".." calls __osd_ea_add_rec() for existing directories on rename
  • osd_ea_add_rec() calls __osd_ea_add_rec() for inserting other entries
    • __osd_ea_add_rec() has a special case for adding the FID-in-dirent (ldp) for normal and IGIF FIDs

I'd prefer to fix this at the osd-ldiskfs level instead of adding more changes to ldiskfs itself, if that is practical. It may be possible to store the ".." rec_len into the directory object when calling osd_ea_index_delete() for "..", and then use this to decide if the rec_len is large enough when storing the ".." entry back again in _osd_ea_add_rec(). That way, the code wouldn't _incorrectly check IGIF FIDs to decide whether to store FID-in-dirent, since regular FID directories may not have space for FID-in-dirent if there was a backup/restore either. Similarly, in the upstream ext4 there is a new feature for "inline data" that stores small files and directories inside the inode, and only has room for the ".." inode number and not a real dirent, which may break in a similar manner.

Another option is simply just avoid storing dirdata for ALL "." and/or ".." entries? That would also avoid complexity related to the dx_root information, and all of the special case code for handling this. The "." FID can be found in the LMA. However, the FID for the ".." entry would always need to be looked up in the OI, so I'm not necessarily in favour of this if we can store this data easily for new directories (which will be the majority of cases).

Comment by nasf (Inactive) [ 21/Jan/13 ]

Firstly, I agreed with you that current handle "." and ".." insert/delete/rename are quite confused. I have already fixed some of them during LFSCK 1.5 project:

1) NOT add FID-in-dirent for "." object, which is unnecessary. Because we can get "."'s FID directly from the object itself. Please check the patches:
http://review.whamcloud.com/#patch,sidebyside,4902,11,lustre/osd-ldiskfs/osd_handler.c
http://review.whamcloud.com/#patch,sidebyside,4912,17,lustre/osd-ldiskfs/osd_handler.c

2) Try to add FID-in-dirent for ".." object, which will not introduce new patches for ldiskfs, because we already have the ldiskfs patch for that, which was named ext4-hash-indexed-dir-dotdot-update-rhel5.patch. I just replaced it with the new one ext4-dir-dotdot-update.patch, which also cleanup others. Please check:
http://review.whamcloud.com/#change,4899

3) NOT distinguish IGIF anymore for FID-in-dirent. Please check:
http://review.whamcloud.com/#patch,sidebyside,4912,17,lustre/osd-ldiskfs/osd_handler.c

Comment by Andreas Dilger [ 22/Jan/13 ]

Patch is something in the direction of what I'd like to see for this fix. It doesn't compile, and I'm not sure that d_fsdata is actually set at this point, but it was at least checked earlier during osd_index_ea_lookup(), and that could save a flag to determine if the FID was in the dirent at lookup time or not.

Comment by nasf (Inactive) [ 22/Jan/13 ]

To be clear, in my patch (http://review.whamcloud.com/#change,4899), it just rename the old ldiskfs patch "ext4-hash-indexed-dir-dotdot-update-rhel5.patch" to the new one "ext4-dir-dotdot-update.patch", no additional ldiskfs patch.

My patch tries to insert FID-in-dirent for "..", which may cause data moving in the directory block. What you worry about is that such data moving may cause directory crash, right? If yes, I will consider your way and avoid such moving.

Comment by nasf (Inactive) [ 23/Jan/13 ]

When osd_indsx_ea_insert() is called, which will trigger dotdot reinsert (for rename case), at that time, from the OSD view, it does not know whether there is enough space to hold FID-in-dirent for the dotdot entry, unless it calls ldiskfs_bread() to read the directory block #0 and locate to the old dotdot entry. But such processing is just repeat the work for ldiskfs_update_dotdot() in ldiskfs.

On the other hand, I have very simple solution to resolve your issues inside ldiskfs as following:

--- a/ldiskfs/kernel_patches/patches/ext4_data_in_dirent-rhel6.patch
+++ b/ldiskfs/kernel_patches/patches/ext4_data_in_dirent-rhel6.patch
@@ -420,12 +420,13 @@ Index: linux-stage/fs/ext4/namei.c
        else
 -              assert(le16_to_cpu(de->rec_len) >= EXT4_DIR_REC_LEN(2));
 +              assert(le16_to_cpu(de->rec_len) >= __EXT4_DIR_REC_LEN(2));
-       de->name_len = 2;
-       strcpy (de->name, "..");
-       ext4_set_de_type(dir->i_sb, de, S_IFDIR);
-+      if (data) {
+       de->name_len = 2;
+       strcpy (de->name, "..");
+-      ext4_set_de_type(dir->i_sb, de, S_IFDIR);
++      if (data != NULL && EXT4_DIR_REC_LEN(de) >= dlen) {
 +              de->name[2] = 0;
 +              memcpy(&de->name[2 + 1], data, dlen);
++              ext4_set_de_type(dir->i_sb, de, S_IFDIR);
 +              de->file_type |= EXT4_DIRENT_LUFID;
 +      }

So I will drop the patch (http://review.whamcloud.com/#change,4899) of updating dotdot in LFSCK 1.5, and just as you suggested, keep the dotdot entry there without FID-in-dirent if there is not enough space to hold the FID-in-dirent. The way is just like above. (contained in the patch http://review.whamcloud.com/#change,5046)

Comment by Andreas Dilger [ 24/Jan/13 ]

Nasf, can you please also submit a separate patch for b2_1.

Comment by nasf (Inactive) [ 25/Jan/13 ]

I have split the patch for dotdot updating in this one:

http://review.whamcloud.com/#change,5179

I will port it to b2_1 also.

Comment by Andreas Dilger [ 30/Jan/13 ]

The patches that modify the handling of "." and ".." are overly complex, and are distributed across several different patches:

  • ext4-hash-indexed-dir-dotdot-update-rhel5.patch
  • ext4_data_in_dirent-rhel6.patch
  • ext4-osd-iop-common-rhel6.patch

This problem isn't caused by the fix in http://review.whamcloud.com/#change,5179, but should be cleaned up as a separate bug fix patch once the other feature landings are done.

It makes sense to clean this code up and put the changes in a single place. My preference would also be to avoid patching ldiskfs if possible, and instead handle this more like Alex's ZFS "."/".." patch. Essentially, ldiskfs will create "." and ".." by itself when the directory is first created, as ext4_mkdir() normally would. Inserting "." is a no-op, removing ".." is also a no-op, and inserting ".." will rewrite the inode and lu_fid information in the existing ".." record in-place.

I suspect this will reduce the complexity of changes to ext4, and result in fewer patches that need to be maintained for every kernel going forward.

Comment by Andreas Dilger [ 02/Feb/13 ]

I see a similar intermittent failure in https://maloo.whamcloud.com/test_sets/229b2948-59f2-11e2-8604-52540035b04c for sanity.sh test_214():

[client]
10:00:05:LustreError: 2398:0:(dir.c:421:ll_get_dir_page()) read cache page: [0x3c0001b71:0x1fe7:0x0] at 0: rc -5
10:00:05:LustreError: 2398:0:(dir.c:583:ll_dir_read()) error reading dir [0x3c0001b71:0x1fe7:0x0] at 0: rc -5

drwxr-xr-x 2 root root 20480 Jan  8 10:00 d214c
ls: reading directory /mnt/lustre/d214c: Input/output error

[MDS console]
10:00:07:Lustre: DEBUG MARKER: == sanity test 214: hash-indexed directory test - bug 20133 == 09:59:54 (1357667994)
10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Unrecognised inode hash code 192 for directory #524991
10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Corrupt dir inode 524991, running e2fsck is recommended.
10:00:07:Lustre: 8324:0:(mdd_object.c:2046:mdd_dir_page_build()) build page failed: -5!

It looks like the dx_root_info got overwritten. It looks like it was failing for a while, then stopped on 01/23 (which was a full week before the 5179 patch landed). Not sure what caused that.

Comment by Andreas Dilger [ 08/Feb/13 ]

For testing this patch resolves LU-2627, the following things must be done:

  • format filesystem with b1_8, and mount it
  • create several large subdirectories (10k entries is probably enough)
  • unmount filesystem
  • "tune2fs -O dirdata /dev/mdtdev"
  • mount filesystem with 2.1
  • rename the subdirectories
  • "ls -l" the subdirectories (should fail)
  • run "e2fsck -f" to verify corruption

If this test is run with master instead of 2.1, or the http://review.whamcloud.com/5179 patch applied to b2_1 (with modifications as needed) no corruption should be seen.

Comment by Oleg Drokin [ 17/Feb/13 ]

patch 5179 cheanly cherrypicks into b2_1, so I am going to test it with a bunch of other stuff.

Comment by nasf (Inactive) [ 21/Feb/13 ]

the patch for b2_1:

http://review.whamcloud.com/#change,5498

Comment by Jodi Levi (Inactive) [ 21/Feb/13 ]

Landed to master and b2_1

Comment by Patrick Farrell (Inactive) [ 12/Sep/14 ]

When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug.

Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug.

The patch (http://review.whamcloud.com/#/c/5179/) is definitely present in both cases, but the ".." directory entry is not being placed correctly.

We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location.

We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.)

When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine.

It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.)

My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well.

When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found.

Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again.

To make clear the importance of this:
It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.

Comment by Patrick Farrell (Inactive) [ 12/Sep/14 ]

This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.)

debugfs: bd 237582
0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................
^^^^^^^^^ ^^ ^^ ^^
inode | | "."
reclen |
namelen
0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$...
^^^^^^^^^ ^^ ^^
inode | |
reclen |
namelen
0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........
^^^^^^^^^^^^^^^^^^^
"Catalyst"
0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st....
^^^^^^^^^
inode
0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m.
^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^
reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^
namelen new parent fid ****
0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e.......
^^^^^^^^^^^^^^^^^^^^^^
**********************
0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t
...

Comment by Patrick Farrell (Inactive) [ 12/Sep/14 ]

I should clarify a bit further: The corruption is not immediately harmful. It is silent, and until e2fsck is run to repair the issue, no user-visible problems occur. But e2fsck complains that ".." is not the second entry, and in repairing the directory, overwrites old information.

Comment by Patrick Farrell (Inactive) [ 12/Sep/14 ]

Further investigation this afternoon shows why we're still seeing a problem despite having the 5179 patch. We're down a different code path:

These are not htree directories, so the "ldiskfs_update_dotdot" code is not called (the "is_dx" test fails - I noticed this by kernel function tracing). Rather, the standard ldiskfs_add_entry path is used, which then calls in to "add_dirent_to_buf". So this is a similar bug, but in that code. As far as I can tell, that code has no special handling of "..". Since ".." cannot be allowed to move from the second position in the directory, this is a significant problem.

It looks to me like it's not one ext4 needs to address, since ext4 ".." dentries should never change length, and the other dentries - which can change length when their name changes - can safely be relocated.

It seems like maybe we can just add this check from ldiskfs_update_dotdot to add_dirent_to_buf:
&& ldiskfs_get_dirent_data_len(de) >= dlen

But we have to special case it to "..". So I suppose a second if statement, checking that and the name against ".."... That should be easy enough to do, though I don't have time at the moment.

Comment by Rick Mohr [ 04/May/15 ]

I have just seen this exact same problem today, but am not sure how to go about addressing it. Should I just avoid having e2fsck fix the problem (until a software fix has been created), or do I let e2fsck "fix" things and then try to recover items from lost+found?

Comment by Patrick Farrell (Inactive) [ 04/May/15 ]

Rick - You can avoid this problem happening to any more directories by unmounting your MDT, turning off dirdata, and remounting. However, those which are damaged are damaged and e2fsck and recovery from lost+found is your only option. The software fix* will prevent further damage, but it won't help with existing damaged directories.

*About that fix: You're almost certainly seeing the closely related https://jira.hpdd.intel.com/browse/LU-5626, rather than LU-2638. LU-2638 was fixed before release of 2.4, so unless you updated to 2.1 from 1.8 and are still running 2.1 (or 2.2 or 2.3, I guess), it's LU-5626. LU-5626 is similar but subtly differently and did get in to 2.4 and 2.5 as released. Note also that if you're hitting this situation and running without the fix, there's also the possibility of kernel panic'ing your MDS, so the workaround of turning off dirdata temporarily is a very good idea (if you can't get a software update quickly).

Generated at Sat Feb 10 01:26:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.