Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5626

Corruption of MDT “..” entry in non-HTree ldiskfs directories

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
    • Lustre 2.4 or newer, file system upgraded from 1.8
    • 3
    • 15741

    Description

      LU-2638 reported directory entry corruption related to FID-in-dirent and the “..” entry in HTree directories.
      We have since discovered an identical problem in non-HTree directories.

      This is essentially exactly the same problem, but it manifests itself slightly different in non-HTree directories. The “..” entry must remain as the second entry in the directory block (FSCK demands this), and when a directory created under 1.8 (now on a 2.4+ server with dirdata enabled) is moved to a new parent, the “..” entry is updated. Exactly as happened in LU-2638, the FID is added to the “..” entry without regards to whether or not there is sufficient space in the second position in the directory block.

      In the lucky case where space is already available in the second entry in a directory, the “..” entry is -recreated in the same place, FID attached. If not, it is created in the next available space of sufficient size. This causes complaints from FSCK, and when FSCK repairs this, it places the updated “..” immediately after “.” again, which causes it to overlap the next entry in the directory block. This entry - which is for a real user created file, not . or .. - is moved to Lost + found.

      This is because add_dirent_to_buf (used when not in a dx directory) has the same bug as “ldiskfs_update_dotdot”, which was fixed in LU-2638. Because the structure of add_dirent_to_buf is a bit different, the fix looks different as well.

      I don’t have time at the moment to commit & update the new ldiskfs patch file to Gerrit, but I will do so shortly. In the meantime, I’m attaching the new patch file & the resulting namei.c to this bug.

      The patch is a bit ugly and could probably use improvement, but in my testing, it does fix the bug.

      I'll share replication details in a comment.

      One ‘technical debt’ problem with this patch:
      This patch, and the one for LU-2638, do not simply avoid writing the FID in to the “..” entry. In fact, they avoid writing the entire data section on to the “..” entry, so if there were a pre-existing “..” entry with something else in data other than the FID, that would be lost on directory moves. Currently, it appears that FID-in-dirent is the only user of this extra section.

      Attachments

        1. ext4-data-in-dirent-dotdot-fixes.patch
          2 kB
        2. ll_fix_mdt_lost_found.sh
          1 kB
        3. namei.c
          94 kB
        4. namei.c
          92 kB

        Issue Links

          Activity

            [LU-5626] Corruption of MDT “..” entry in non-HTree ldiskfs directories

            Looks like we will need this for SLES11 SP3 as well.

            simmonsja James A Simmons added a comment - Looks like we will need this for SLES11 SP3 as well.

            To clarify RHEL7 and SLES12 server side support are both 2.8 feature so this ticket is safe to close. Just need to integrate this work into those distros for 2.8.

            simmonsja James A Simmons added a comment - To clarify RHEL7 and SLES12 server side support are both 2.8 feature so this ticket is safe to close. Just need to integrate this work into those distros for 2.8.

            Now these changes need to be integrated into the upcoming SLES12 and RHEL7 ldiskfs work.

            simmonsja James A Simmons added a comment - Now these changes need to be integrated into the upcoming SLES12 and RHEL7 ldiskfs work.

            James, you mean the ldiskfs/kernel_patches/patches/sles11sp2/ext4-data-in-dirent.patch should be updated? or anything else?

            yong.fan nasf (Inactive) added a comment - James, you mean the ldiskfs/kernel_patches/patches/sles11sp2/ext4-data-in-dirent.patch should be updated? or anything else?

            Once this lands the ldiskfs patches for SLES12 and RHEL7 will need to be updated.

            simmonsja James A Simmons added a comment - Once this lands the ldiskfs patches for SLES12 and RHEL7 will need to be updated.

            We also saw file systems taking errors and getting remounted read-only, and we were initially unable to figure out why. It turns out that when an incorrect/damaged (IE, ".." entry in the wrong place) non-HTree directory is converted to an HTree directory, the conversion goes badly wrong, and the resulting directory is badly corrupt.

            With the patch for this bug, it's no longer possible to get in the bad state. I thought I'd share the errors here so others who hit this bug have a better chance of finding this JIRA ticket.

            Here's what the resulting errors look like - The key thing is "rec_len=2049", which we've always seen in this situation:
            LDISKFS-fs error (device sdi): ldiskfs_dx_find_entry: bad entry in directory #32789: rec_len % 4 != 0 - block=16755offset=24(24), inode=0, rec_len=2049, name_len=0
            Aborting journal on device sdi-8.
            LDISKFS-fs (sdi): Remounting filesystem read-only
            LDISKFS-fs error (device sdi): ldiskfs_dx_find_entry: bad entry in directory #32789: rec_len % 4 != 0 - block=16755offset=24(24), inode=0, rec_len=2049, name_len=0
            Lustre: 2208:0:(mdd_dir.c:2926:mdd_rename()) cent5602-MDD0000: sp obj dotdot delete error: rc = -2
            Lustre: 2208:0:(mdd_dir.c:2933:mdd_rename()) cent5602-MDD0000: sp obj dotdot insert error: rc = -30
            LDISKFS-fs error (device sdi) in add_dirent_to_buf: Journal has aborted
            Lustre: 2208:0:(mdd_dir.c:2942:mdd_rename()) sp obj fix error: rc = -30
            LustreError: 2208:0:(osd_io.c:1595:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
            LustreError: 2208:0:(osd_handler.c:1126:osd_trans_stop()) Failure in transaction hook: -30
            LustreError: 2208:0:(osd_handler.c:1135:osd_trans_stop()) Failure to stop transaction: -30
            LustreError: 2205:0:(osd_handler.c:910:osd_trans_commit_cb()) transaction @0xffff88022e004c00 commit error: 2

            paf Patrick Farrell (Inactive) added a comment - - edited We also saw file systems taking errors and getting remounted read-only, and we were initially unable to figure out why. It turns out that when an incorrect/damaged (IE, ".." entry in the wrong place) non-HTree directory is converted to an HTree directory, the conversion goes badly wrong, and the resulting directory is badly corrupt. With the patch for this bug, it's no longer possible to get in the bad state. I thought I'd share the errors here so others who hit this bug have a better chance of finding this JIRA ticket. Here's what the resulting errors look like - The key thing is "rec_len=2049", which we've always seen in this situation: LDISKFS-fs error (device sdi): ldiskfs_dx_find_entry: bad entry in directory #32789: rec_len % 4 != 0 - block=16755offset=24(24), inode=0, rec_len=2049, name_len=0 Aborting journal on device sdi-8. LDISKFS-fs (sdi): Remounting filesystem read-only LDISKFS-fs error (device sdi): ldiskfs_dx_find_entry: bad entry in directory #32789: rec_len % 4 != 0 - block=16755offset=24(24), inode=0, rec_len=2049, name_len=0 Lustre: 2208:0:(mdd_dir.c:2926:mdd_rename()) cent5602-MDD0000: sp obj dotdot delete error: rc = -2 Lustre: 2208:0:(mdd_dir.c:2933:mdd_rename()) cent5602-MDD0000: sp obj dotdot insert error: rc = -30 LDISKFS-fs error (device sdi) in add_dirent_to_buf: Journal has aborted Lustre: 2208:0:(mdd_dir.c:2942:mdd_rename()) sp obj fix error: rc = -30 LustreError: 2208:0:(osd_io.c:1595:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 LustreError: 2208:0:(osd_handler.c:1126:osd_trans_stop()) Failure in transaction hook: -30 LustreError: 2208:0:(osd_handler.c:1135:osd_trans_stop()) Failure to stop transaction: -30 LustreError: 2205:0:(osd_handler.c:910:osd_trans_commit_cb()) transaction @0xffff88022e004c00 commit error: 2

            Patch is here:
            http://review.whamcloud.com/11939

            Local testing suggests this resolves the issue.

            paf Patrick Farrell (Inactive) added a comment - Patch is here: http://review.whamcloud.com/11939 Local testing suggests this resolves the issue.

            The attached patch attempts to resolve the issue by special casing "..". A special, alternate length for ".." is calculated, which does not include the data section. When a dotdot entry is identified, the space checking code first checks to see if there is sufficient space for the data secton; if there is not, it then checks for space for the special alternate length. This guarantees ".." will be placed on top of the pre-existing ".." entry, even when there is not additional space for the FID.

            The result of this space check is recorded and is used to determine whether or not to write the data section.

            paf Patrick Farrell (Inactive) added a comment - The attached patch attempts to resolve the issue by special casing "..". A special, alternate length for ".." is calculated, which does not include the data section. When a dotdot entry is identified, the space checking code first checks to see if there is sufficient space for the data secton; if there is not, it then checks for space for the special alternate length. This guarantees ".." will be placed on top of the pre-existing ".." entry, even when there is not additional space for the FID. The result of this space check is recorded and is used to determine whether or not to write the data section.

            Patch file & resulting namei.c from ldiskfs "make" of current master+this patch.

            paf Patrick Farrell (Inactive) added a comment - Patch file & resulting namei.c from ldiskfs "make" of current master+this patch.

            This problem can be reproduced by formatting a file system under 1.8 (or, probably, earlier versions of 2.x), creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4 or newer (bug exists in master as well) and moving that directory to a new location.

            Running fsck will show errors similar to those reported in LU-2638.

            This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.)
            debugfs: bd 237582
            0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................
            ^^^^^^^^^ ^^ ^^ ^^
            inode | | "."
            reclen |
            namelen
            0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$...
            ^^^^^^^^^ ^^ ^^
            inode | |
            reclen |
            namelen
            0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........
            ^^^^^^^^^^^^^^^^^^^
            "Catalyst"
            0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st....
            ^^^^^^^^^
            inode
            0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m.
            ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^
            reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^
            namelen new parent fid ****
            0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e.......
            ^^^^^^^^^^^^^^^^^^^^^^
            **********************
            0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t
            ...

            paf Patrick Farrell (Inactive) added a comment - This problem can be reproduced by formatting a file system under 1.8 (or, probably, earlier versions of 2.x), creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4 or newer (bug exists in master as well) and moving that directory to a new location. Running fsck will show errors similar to those reported in LU-2638 . This is an example of a directory block AFTER the problem has occurred. Note the presence of the first entry, ".". Its length (24 decimal) includes the old ".." entry, which is seen in the second 12 bytes, but is not read because it's inside the rec_len of the first entry. Looking further down the directory block, we see a normal file dentry, then after that, we see the new ".." entry (look for '2e2e'), which includes the FID of the new parent. (It is followed by other file dentries.) debugfs: bd 237582 0000 b782 0300 1800 0102 2e00 0000 8f8b cb02 ................ ^^^^^^^^^ ^^ ^^ ^^ inode | | "." reclen | namelen 0020 0c00 0202 2e2e 0000 b982 0300 2400 0812 ............$... ^^^^^^^^^ ^^ ^^ inode | | reclen | namelen 0040 4361 7461 6c79 7374 0011 0000 0000 0003 Catalyst........ ^^^^^^^^^^^^^^^^^^^ "Catalyst" 0060 82b9 2475 25ab 0000 0000 7374 1f80 0702 ..$u%.....st.... ^^^^^^^^^ inode 0100 2000 0212 2e2e 0011 0000 0002 0000 6dd8 .............m. ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^ reclen | ".." LEN ^^^^^^^^^^^^^^^^^^^ namelen new parent fid **** 0120 0001 9551 0000 0000 6500 0000 bc82 0300 ...Q....e....... ^^^^^^^^^^^^^^^^^^^^^^ ********************** 0140 2800 0e11 434d 616b 654c 6973 7473 2e74 (...CMakeLists.t ...

            People

              bogl Bob Glossman (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: