Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0, Lustre 2.1.5
    • Lustre 2.4.0, Lustre 2.1.3, Lustre 2.1.5
    • The MDT is formatted/modified to have the "extents" feature enabled, but I'm not sure if that is relevant for this part of the problem.
    • 6170

    Description

      It appears that there is a systematic corruption of the ".." entry in the directory, possibly only affecting htree directories when the "dirdata" feature is enabled.

      e2fsck 1.42.6.wc2 (10-Dec-2012)
      nbp1-MDT0000 has been mounted 121 times without being checked, check forced.
      Pass 1: Checking inodes, blocks, and sizes
      Pass 2: Checking directory structure
      [...snip...]
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=40967687) in directory inode 40967686 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=40971040) in directory inode 40971039 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=44588784) in directory inode 44588782 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=47195750) in directory inode 47195749 should be '..'
      Fix? no
      [...snip...]
      Pass 3: Checking directory connectivity
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      [...snip...]
      

      Attachments

        Activity

          [LU-2638] corruption of MDT ".." entry in some ldiskfs directories

          Rick - You can avoid this problem happening to any more directories by unmounting your MDT, turning off dirdata, and remounting. However, those which are damaged are damaged and e2fsck and recovery from lost+found is your only option. The software fix* will prevent further damage, but it won't help with existing damaged directories.

          *About that fix: You're almost certainly seeing the closely related https://jira.hpdd.intel.com/browse/LU-5626, rather than LU-2638. LU-2638 was fixed before release of 2.4, so unless you updated to 2.1 from 1.8 and are still running 2.1 (or 2.2 or 2.3, I guess), it's LU-5626. LU-5626 is similar but subtly differently and did get in to 2.4 and 2.5 as released. Note also that if you're hitting this situation and running without the fix, there's also the possibility of kernel panic'ing your MDS, so the workaround of turning off dirdata temporarily is a very good idea (if you can't get a software update quickly).

          paf Patrick Farrell (Inactive) added a comment - Rick - You can avoid this problem happening to any more directories by unmounting your MDT, turning off dirdata, and remounting. However, those which are damaged are damaged and e2fsck and recovery from lost+found is your only option. The software fix* will prevent further damage, but it won't help with existing damaged directories. *About that fix: You're almost certainly seeing the closely related https://jira.hpdd.intel.com/browse/LU-5626 , rather than LU-2638 . LU-2638 was fixed before release of 2.4, so unless you updated to 2.1 from 1.8 and are still running 2.1 (or 2.2 or 2.3, I guess), it's LU-5626 . LU-5626 is similar but subtly differently and did get in to 2.4 and 2.5 as released. Note also that if you're hitting this situation and running without the fix, there's also the possibility of kernel panic'ing your MDS, so the workaround of turning off dirdata temporarily is a very good idea (if you can't get a software update quickly).
          rmohr Rick Mohr added a comment -

          I have just seen this exact same problem today, but am not sure how to go about addressing it. Should I just avoid having e2fsck fix the problem (until a software fix has been created), or do I let e2fsck "fix" things and then try to recover items from lost+found?

          rmohr Rick Mohr added a comment - I have just seen this exact same problem today, but am not sure how to go about addressing it. Should I just avoid having e2fsck fix the problem (until a software fix has been created), or do I let e2fsck "fix" things and then try to recover items from lost+found?

          Further investigation this afternoon shows why we're still seeing a problem despite having the 5179 patch. We're down a different code path:

          These are not htree directories, so the "ldiskfs_update_dotdot" code is not called (the "is_dx" test fails - I noticed this by kernel function tracing). Rather, the standard ldiskfs_add_entry path is used, which then calls in to "add_dirent_to_buf". So this is a similar bug, but in that code. As far as I can tell, that code has no special handling of "..". Since ".." cannot be allowed to move from the second position in the directory, this is a significant problem.

          It looks to me like it's not one ext4 needs to address, since ext4 ".." dentries should never change length, and the other dentries - which can change length when their name changes - can safely be relocated.

          It seems like maybe we can just add this check from ldiskfs_update_dotdot to add_dirent_to_buf:
          && ldiskfs_get_dirent_data_len(de) >= dlen

          But we have to special case it to "..". So I suppose a second if statement, checking that and the name against ".."... That should be easy enough to do, though I don't have time at the moment.

          paf Patrick Farrell (Inactive) added a comment - - edited Further investigation this afternoon shows why we're still seeing a problem despite having the 5179 patch. We're down a different code path: These are not htree directories, so the "ldiskfs_update_dotdot" code is not called (the "is_dx" test fails - I noticed this by kernel function tracing). Rather, the standard ldiskfs_add_entry path is used, which then calls in to "add_dirent_to_buf". So this is a similar bug, but in that code. As far as I can tell, that code has no special handling of "..". Since ".." cannot be allowed to move from the second position in the directory, this is a significant problem. It looks to me like it's not one ext4 needs to address, since ext4 ".." dentries should never change length, and the other dentries - which can change length when their name changes - can safely be relocated. It seems like maybe we can just add this check from ldiskfs_update_dotdot to add_dirent_to_buf: && ldiskfs_get_dirent_data_len(de) >= dlen But we have to special case it to "..". So I suppose a second if statement, checking that and the name against ".."... That should be easy enough to do, though I don't have time at the moment.

          I should clarify a bit further: The corruption is not immediately harmful. It is silent, and until e2fsck is run to repair the issue, no user-visible problems occur. But e2fsck complains that ".." is not the second entry, and in repairing the directory, overwrites old information.

          paf Patrick Farrell (Inactive) added a comment - I should clarify a bit further: The corruption is not immediately harmful. It is silent, and until e2fsck is run to repair the issue, no user-visible problems occur. But e2fsck complains that ".." is not the second entry, and in repairing the directory, overwrites old information.

          This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.)

          debugfs: bd 237582
          0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................
          ^^^^^^^^^ ^^ ^^ ^^
          inode | | "."
          reclen |
          namelen
          0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$...
          ^^^^^^^^^ ^^ ^^
          inode | |
          reclen |
          namelen
          0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........
          ^^^^^^^^^^^^^^^^^^^
          "Catalyst"
          0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st....
          ^^^^^^^^^
          inode
          0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m.
          ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^
          reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^
          namelen new parent fid ****
          0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e.......
          ^^^^^^^^^^^^^^^^^^^^^^
          **********************
          0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t
          ...

          paf Patrick Farrell (Inactive) added a comment - This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.) debugfs: bd 237582 0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................ ^^^^^^^^^ ^^ ^^ ^^ inode | | "." reclen | namelen 0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$... ^^^^^^^^^ ^^ ^^ inode | | reclen | namelen 0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........ ^^^^^^^^^^^^^^^^^^^ "Catalyst" 0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st.... ^^^^^^^^^ inode 0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m. ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^ reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^ namelen new parent fid **** 0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e....... ^^^^^^^^^^^^^^^^^^^^^^ ********************** 0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t ...

          When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug.

          Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug.

          The patch (http://review.whamcloud.com/#/c/5179/) is definitely present in both cases, but the ".." directory entry is not being placed correctly.

          We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location.

          We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.)

          When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine.

          It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.)

          My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well.

          When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found.

          Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again.

          To make clear the importance of this:
          It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.

          paf Patrick Farrell (Inactive) added a comment - When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug. Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug. The patch ( http://review.whamcloud.com/#/c/5179/ ) is definitely present in both cases, but the ".." directory entry is not being placed correctly. We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location. We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.) When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine. It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.) My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well. When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found. Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again. To make clear the importance of this: It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.

          Landed to master and b2_1

          jlevi Jodi Levi (Inactive) added a comment - Landed to master and b2_1
          yong.fan nasf (Inactive) added a comment - the patch for b2_1: http://review.whamcloud.com/#change,5498
          green Oleg Drokin added a comment -

          patch 5179 cheanly cherrypicks into b2_1, so I am going to test it with a bunch of other stuff.

          green Oleg Drokin added a comment - patch 5179 cheanly cherrypicks into b2_1, so I am going to test it with a bunch of other stuff.
          adilger Andreas Dilger added a comment - - edited

          For testing this patch resolves LU-2627, the following things must be done:

          • format filesystem with b1_8, and mount it
          • create several large subdirectories (10k entries is probably enough)
          • unmount filesystem
          • "tune2fs -O dirdata /dev/mdtdev"
          • mount filesystem with 2.1
          • rename the subdirectories
          • "ls -l" the subdirectories (should fail)
          • run "e2fsck -f" to verify corruption

          If this test is run with master instead of 2.1, or the http://review.whamcloud.com/5179 patch applied to b2_1 (with modifications as needed) no corruption should be seen.

          adilger Andreas Dilger added a comment - - edited For testing this patch resolves LU-2627 , the following things must be done: format filesystem with b1_8, and mount it create several large subdirectories (10k entries is probably enough) unmount filesystem " tune2fs -O dirdata /dev/mdtdev " mount filesystem with 2.1 rename the subdirectories " ls -l " the subdirectories (should fail) run " e2fsck -f " to verify corruption If this test is run with master instead of 2.1, or the http://review.whamcloud.com/5179 patch applied to b2_1 (with modifications as needed) no corruption should be seen.

          People

            yong.fan nasf (Inactive)
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: