Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17711

ldiskfs corruption on el9 (dx_probe: Corrupt directory)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,

      Opening this for Stanford (thanks @sthiell!).
      We're seeing ldiskfs corruptions on MDTs on 2.15.59_32 / el9.2 servers (also confirmed with 2.15.61_225 on el9.3 with LU-17700 fixes).

      There's a message like this in dmesg:

      Apr 03 22:58:53 elm-rcf-md1-s1 kernel: LDISKFS-fs warning (device dm-1): dx_probe:1138: inode #1259905629: comm mdt_rdpg01_001: dx entry: limit 0 != root limit 509
      Apr 03 22:58:53 elm-rcf-md1-s1 kernel: LDISKFS-fs warning (device dm-1): dx_probe:1289: inode #1259905629: comm mdt_rdpg01_001: Corrupt directory, running e2fsck is recommended

      Running e2fsck as suggested does yield some actual corruption (this is on a test system, but the fixed inode is the same as the one in dmesg):

      # e2fsck -f mdt0
      e2fsck 1.47.0-wc6 (07-Dec-2023)
      Pass 1: Checking inodes, blocks, and sizes
      HTREE directory inode 27370 has an invalid root node.
      Clear HTree index<y>? yes
      Pass 2: Checking directory structure
      Setting filetype for entry '..' in /ROOT/repro/d4 (27370) to 2.
      [rest as normal]

      This happens normally with minio when uploading a large file and then accessing it, but I've trimmed down the reproducer to a few lines of shell:

      # - removes /mnt/lustre0/foo
      # - create /mnt/lustre0/foo/tmp/subdir, fill it
      # - move it to /mnt/lustre0/foo/subdir
      # - list its contents
      dir="/mnt/lustre0/foo"
      # number of items in subdir -- might need more depending on setup
      count=120
      
      # cleanup
      rm -rf "$dir"
      
      set -e
      
      mkdir -p "$dir/tmp/subdir"
      seq 1 "$count" | while read -r i; do
              touch "$dir/tmp/subdir/part.$i"
      done
      mv "$dir/tmp/subdir" "$dir/"
      ls "$dir/subdir" > /dev/null 

      Since it's a ldiskfs corruption it obviously doesn't happen with zfs. It also doesn't happen with el8.5 on the same lustre, so it would be a problem with the newer el9 ldiskfs (either newer kernel or patches).

      I've spent half a dozen of hours investigating this in details and it turns out it's caused by these memsets in namei.c introduced upstream in 6c0912739699 ("ext4: wipe ext4_dir_entry2 upon file deletion"). The snippet below "fixes" this:

      diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
      index efac7e276d4e..2085fa384240 100644
      --- a/fs/ext4/namei.c
      +++ b/fs/ext4/namei.c
      @@ -3285,18 +3285,8 @@ int ldiskfs_generic_delete_entry(struct inode *dir,
                                              ldiskfs_rec_len_from_disk(de->rec_len,
                                                                     blocksize),
                                              blocksize);
      -
      -                               /* wipe entire dir_entry */
      -                               memset(de, 0, ldiskfs_rec_len_from_disk(de->rec_len,
      -                                                               blocksize));
                              } else {
      -                               /* wipe dir_entry excluding the rec_len field */
                                      de->inode = 0;
      -                               memset(&de->name_len, 0,
      -                                       ldiskfs_rec_len_from_disk(de->rec_len,
      -                                                               blocksize) -
      -                                       offsetof(struct ldiskfs_dir_entry_2,
      -                                                               name_len));
                              }                        inode_inc_iversion(dir); 

      There's probably a better fix (and there's at least another memset in dx_move_dirents that'll need adusting), but for now I've confirmed this diff indeed makes the problem go away, and conversely adding identical memsets to the el8 code generates the same problem.

      From testing it looks like we also can't erase the type either (I assume it's because of LDISKFS_DIRENT_LUFID in type?), and there's some code that puts a length in the first byte of the data so we can't clear that either (I tried clearing inode + memset(&de->name, 0, rec_len - offset(name)) to no avail)... I'll leave that to people who understand ldiskfs more than me.

      For now I guess we can just revert 6c0912739699 ("ext4: wipe ext4_dir_entry2 upon file deletion"), I'll send a patch adding such a revert to the el9* series if there isn't any better idea.

      Longer term I guess if someone can tell me what we need to preserve we can make some surgery here, but I'm running out of time for today

      Thanks!

      Attachments

        Activity

          People

            dongyang Dongyang Li
            asmadeus Dominique Martinet
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: