Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0, Lustre 2.1.5
    • Lustre 2.4.0, Lustre 2.1.3, Lustre 2.1.5
    • The MDT is formatted/modified to have the "extents" feature enabled, but I'm not sure if that is relevant for this part of the problem.
    • 6170

    Description

      It appears that there is a systematic corruption of the ".." entry in the directory, possibly only affecting htree directories when the "dirdata" feature is enabled.

      e2fsck 1.42.6.wc2 (10-Dec-2012)
      nbp1-MDT0000 has been mounted 121 times without being checked, check forced.
      Pass 1: Checking inodes, blocks, and sizes
      Pass 2: Checking directory structure
      [...snip...]
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=40967687) in directory inode 40967686 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=40971040) in directory inode 40971039 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=44588784) in directory inode 44588782 should be '..'
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is duplicate '..' entry.
      Fix? no
      Entry '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is a link to directory /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Second entry 'Header' (inode=47195750) in directory inode 47195749 should be '..'
      Fix? no
      [...snip...]
      Pass 3: Checking directory connectivity
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00000 (40930307) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00033 (40967686) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00029 (40971039) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      '..' in /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles/plt00036 (44588782) is <The NULL inode> (0), should be /ROOT/arosen2/vmcluster/nodrp/T20K_alpha1_prodrun/pltfiles (40919273).
      Fix? no
      [...snip...]
      

      Attachments

        Activity

          [LU-2638] corruption of MDT ".." entry in some ldiskfs directories

          I should clarify a bit further: The corruption is not immediately harmful. It is silent, and until e2fsck is run to repair the issue, no user-visible problems occur. But e2fsck complains that ".." is not the second entry, and in repairing the directory, overwrites old information.

          paf Patrick Farrell (Inactive) added a comment - I should clarify a bit further: The corruption is not immediately harmful. It is silent, and until e2fsck is run to repair the issue, no user-visible problems occur. But e2fsck complains that ".." is not the second entry, and in repairing the directory, overwrites old information.

          This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.)

          debugfs: bd 237582
          0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................
          ^^^^^^^^^ ^^ ^^ ^^
          inode | | "."
          reclen |
          namelen
          0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$...
          ^^^^^^^^^ ^^ ^^
          inode | |
          reclen |
          namelen
          0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........
          ^^^^^^^^^^^^^^^^^^^
          "Catalyst"
          0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st....
          ^^^^^^^^^
          inode
          0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m.
          ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^
          reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^
          namelen new parent fid ****
          0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e.......
          ^^^^^^^^^^^^^^^^^^^^^^
          **********************
          0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t
          ...

          paf Patrick Farrell (Inactive) added a comment - This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.) debugfs: bd 237582 0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................ ^^^^^^^^^ ^^ ^^ ^^ inode | | "." reclen | namelen 0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$... ^^^^^^^^^ ^^ ^^ inode | | reclen | namelen 0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........ ^^^^^^^^^^^^^^^^^^^ "Catalyst" 0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st.... ^^^^^^^^^ inode 0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m. ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^ reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^ namelen new parent fid **** 0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e....... ^^^^^^^^^^^^^^^^^^^^^^ ********************** 0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t ...

          When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug.

          Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug.

          The patch (http://review.whamcloud.com/#/c/5179/) is definitely present in both cases, but the ".." directory entry is not being placed correctly.

          We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location.

          We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.)

          When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine.

          It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.)

          My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well.

          When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found.

          Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again.

          To make clear the importance of this:
          It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.

          paf Patrick Farrell (Inactive) added a comment - When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug. Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug. The patch ( http://review.whamcloud.com/#/c/5179/ ) is definitely present in both cases, but the ".." directory entry is not being placed correctly. We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location. We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.) When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine. It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.) My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well. When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found. Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again. To make clear the importance of this: It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.

          Landed to master and b2_1

          jlevi Jodi Levi (Inactive) added a comment - Landed to master and b2_1
          yong.fan nasf (Inactive) added a comment - the patch for b2_1: http://review.whamcloud.com/#change,5498
          green Oleg Drokin added a comment -

          patch 5179 cheanly cherrypicks into b2_1, so I am going to test it with a bunch of other stuff.

          green Oleg Drokin added a comment - patch 5179 cheanly cherrypicks into b2_1, so I am going to test it with a bunch of other stuff.
          adilger Andreas Dilger added a comment - - edited

          For testing this patch resolves LU-2627, the following things must be done:

          • format filesystem with b1_8, and mount it
          • create several large subdirectories (10k entries is probably enough)
          • unmount filesystem
          • "tune2fs -O dirdata /dev/mdtdev"
          • mount filesystem with 2.1
          • rename the subdirectories
          • "ls -l" the subdirectories (should fail)
          • run "e2fsck -f" to verify corruption

          If this test is run with master instead of 2.1, or the http://review.whamcloud.com/5179 patch applied to b2_1 (with modifications as needed) no corruption should be seen.

          adilger Andreas Dilger added a comment - - edited For testing this patch resolves LU-2627 , the following things must be done: format filesystem with b1_8, and mount it create several large subdirectories (10k entries is probably enough) unmount filesystem " tune2fs -O dirdata /dev/mdtdev " mount filesystem with 2.1 rename the subdirectories " ls -l " the subdirectories (should fail) run " e2fsck -f " to verify corruption If this test is run with master instead of 2.1, or the http://review.whamcloud.com/5179 patch applied to b2_1 (with modifications as needed) no corruption should be seen.

          I see a similar intermittent failure in https://maloo.whamcloud.com/test_sets/229b2948-59f2-11e2-8604-52540035b04c for sanity.sh test_214():

          [client]
          10:00:05:LustreError: 2398:0:(dir.c:421:ll_get_dir_page()) read cache page: [0x3c0001b71:0x1fe7:0x0] at 0: rc -5
          10:00:05:LustreError: 2398:0:(dir.c:583:ll_dir_read()) error reading dir [0x3c0001b71:0x1fe7:0x0] at 0: rc -5
          
          drwxr-xr-x 2 root root 20480 Jan  8 10:00 d214c
          ls: reading directory /mnt/lustre/d214c: Input/output error
          
          [MDS console]
          10:00:07:Lustre: DEBUG MARKER: == sanity test 214: hash-indexed directory test - bug 20133 == 09:59:54 (1357667994)
          10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Unrecognised inode hash code 192 for directory #524991
          10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Corrupt dir inode 524991, running e2fsck is recommended.
          10:00:07:Lustre: 8324:0:(mdd_object.c:2046:mdd_dir_page_build()) build page failed: -5!
          

          It looks like the dx_root_info got overwritten. It looks like it was failing for a while, then stopped on 01/23 (which was a full week before the 5179 patch landed). Not sure what caused that.

          adilger Andreas Dilger added a comment - I see a similar intermittent failure in https://maloo.whamcloud.com/test_sets/229b2948-59f2-11e2-8604-52540035b04c for sanity.sh test_214(): [client] 10:00:05:LustreError: 2398:0:(dir.c:421:ll_get_dir_page()) read cache page: [0x3c0001b71:0x1fe7:0x0] at 0: rc -5 10:00:05:LustreError: 2398:0:(dir.c:583:ll_dir_read()) error reading dir [0x3c0001b71:0x1fe7:0x0] at 0: rc -5 drwxr-xr-x 2 root root 20480 Jan 8 10:00 d214c ls: reading directory /mnt/lustre/d214c: Input/output error [MDS console] 10:00:07:Lustre: DEBUG MARKER: == sanity test 214: hash-indexed directory test - bug 20133 == 09:59:54 (1357667994) 10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Unrecognised inode hash code 192 for directory #524991 10:00:07:LDISKFS-fs warning (device dm-0): dx_probe: Corrupt dir inode 524991, running e2fsck is recommended. 10:00:07:Lustre: 8324:0:(mdd_object.c:2046:mdd_dir_page_build()) build page failed: -5! It looks like the dx_root_info got overwritten. It looks like it was failing for a while, then stopped on 01/23 (which was a full week before the 5179 patch landed). Not sure what caused that.

          The patches that modify the handling of "." and ".." are overly complex, and are distributed across several different patches:

          • ext4-hash-indexed-dir-dotdot-update-rhel5.patch
          • ext4_data_in_dirent-rhel6.patch
          • ext4-osd-iop-common-rhel6.patch

          This problem isn't caused by the fix in http://review.whamcloud.com/#change,5179, but should be cleaned up as a separate bug fix patch once the other feature landings are done.

          It makes sense to clean this code up and put the changes in a single place. My preference would also be to avoid patching ldiskfs if possible, and instead handle this more like Alex's ZFS "."/".." patch. Essentially, ldiskfs will create "." and ".." by itself when the directory is first created, as ext4_mkdir() normally would. Inserting "." is a no-op, removing ".." is also a no-op, and inserting ".." will rewrite the inode and lu_fid information in the existing ".." record in-place.

          I suspect this will reduce the complexity of changes to ext4, and result in fewer patches that need to be maintained for every kernel going forward.

          adilger Andreas Dilger added a comment - The patches that modify the handling of "." and ".." are overly complex, and are distributed across several different patches: ext4-hash-indexed-dir-dotdot-update-rhel5.patch ext4_data_in_dirent-rhel6.patch ext4-osd-iop-common-rhel6.patch This problem isn't caused by the fix in http://review.whamcloud.com/#change,5179 , but should be cleaned up as a separate bug fix patch once the other feature landings are done. It makes sense to clean this code up and put the changes in a single place. My preference would also be to avoid patching ldiskfs if possible, and instead handle this more like Alex's ZFS "."/".." patch. Essentially, ldiskfs will create "." and ".." by itself when the directory is first created, as ext4_mkdir() normally would. Inserting "." is a no-op, removing ".." is also a no-op, and inserting ".." will rewrite the inode and lu_fid information in the existing ".." record in-place. I suspect this will reduce the complexity of changes to ext4, and result in fewer patches that need to be maintained for every kernel going forward.

          I have split the patch for dotdot updating in this one:

          http://review.whamcloud.com/#change,5179

          I will port it to b2_1 also.

          yong.fan nasf (Inactive) added a comment - I have split the patch for dotdot updating in this one: http://review.whamcloud.com/#change,5179 I will port it to b2_1 also.

          Nasf, can you please also submit a separate patch for b2_1.

          adilger Andreas Dilger added a comment - Nasf, can you please also submit a separate patch for b2_1.

          People

            yong.fan nasf (Inactive)
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: