[LU-11922] mkfs.lustre in 1.44.3.wc1 causes corruption if 'metadata_csum' option enabled Created: 05/Feb/19  Updated: 05/Oct/20  Resolved: 16/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: LTS12, ldiskfs, nasa

Issue Links:
Related
is related to LU-12196 mkfs.lustre should handle large MDTs ... Resolved
is related to LU-12827 sanity: test 900 read-only filesystem... Reopened
is related to LU-13650 ldiskfs: enable metadata_csum on MDT/OST Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
# mkfs.lustre --mgs --fsname=lustre /dev/sda 
# mkfs.lustre --ost --servicenode=172.16.251.20@o2ib --mgsnode=172.16.251.20@o2ib --fsname=lustre --index=0 --reformat --mkfsoptions='-text4' /dev/sdc 

# mount -t lustre /dev/sda /tmp/mgs/
# mount -t lustre /dev/sdc /tmp/ost0/
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs error (device sdc): htree_dirblock_to_tree:1278: inode #11272193: block 721437184:
         comm mount.lustre: bad entry in directory: rec_len is too small for name_len - offset=4084(4084), inode=0, rec_len=12, name_len=0
Feb  5 06:08:59 ai200-7f94-vm00 kernel: Aborting journal on device sdc-8.
Feb  5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): Remounting filesystem read-only
Feb  5 06:09:04 ai200-7f94-vm00 kernel: LDISKFS-fs warning (device sdc): kmmpd:187: kmmpd being stopped since filesystem has been remounted as readonly.

Without '-t ext4', it was no problem.
Also, 1.42.13.wc6 didn't cause problem even with '-t ext4'. it seems lustre version doesn't matter. hit problem with 2.10, 2.12 and master.



 Comments   
Comment by Andreas Dilger [ 05/Feb/19 ]

I'm able to reproduce this locally. It looks like it is caused by the use of metadata_csum, which is inherited from /etc/mke2fs.conf when "-t ext4" is enabled. Removing metadata_csum from /etc/mke2fs.conf avoids the problem.

formatting backing filesystem ldiskfs on /dev/loop0
        target name   testfs:OST0000
        kilobytes     400000
        options       -t ext4 -I 512 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg
                      -G 256 -E resize="4290772992",lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0000 -t ext4 -I 512 -q
           -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F /dev/loop0 400000k

Filesystem volume name:   testfs:OST0000
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
                   needs_recovery extent 64bit flex_bg sparse_super large_file
                   huge_file dir_nlink extra_isize quota metadata_csum
Checksum type:            crc32c
Checksum:                 0x4ae324f9

The later "-O" features that mkfs.lustre enables to not disable the default features that /etc/mke2fs.conf enables.

I suspect it is some kind of bad interaction between the dir_data feature (even though it is not enabled on the OST) and the metadata_csum feature. Using metadata_csum stores a checksum in a fake directory entry at the end of each directory block. The dirdata feature tries to reserve enough space in the dirent to store extra FID data, so my guess is that htree_dirblock_to_tree() is trying to get enough space for the dirdata field, even though this feature is disabled on the OST.

Looking at the code more closely, it definitely seems that this is the case:

/*      
 * This is a bogus directory entry at the end of each leaf block that
 * records checksums.
 */
struct ext4_dir_entry_tail {
        __le32  det_reserved_zero1;     /* Pretend to be unused */
        __le16  det_rec_len;            /* 12 */
        __u8    det_reserved_zero2;     /* Zero name length */
        __u8    det_reserved_ft;        /* 0xDE, fake file type */
        __le32  det_checksum;           /* crc32c(uuid+inum+dirblock) */
};

Sadly, it seems that setting det_reserved_ft = 0xde is causing EXT4_DIR_REC_LEN(de)->ext4_get_dirent_data_len() to think there are extra dirdata records in the directory, adding bogus record lengths onto the dirent and triggering this error. This is quite problematic, because the metadata_csum feature is already the default for e2fsprogs-1.44, and the kernel code using det_reserved_ft = 0xde has been in ext4 since commit v3.4-rc5-2-ge61539189606, so there is no way to fix this to work transparently with dirdata.

I think a few options exist:

  • only check ext4_get_dirent_data_len() if the dirdata feature is enabled, which would probably be smart, but doesn't fix this problem in the future when we actually want metadata_csum enabled. This is also hard to implement without passing the superblock pointer to EXT4_DIR_REC_LEN(), or masking off the high bits of de->file_type in the caller (which should be zero if this feature is disabled, but will break metadata_csum?)
  • check in ext4_get_dirent_data_len() for the case de->file_type == EXT4_FT_DIR_CSUM and return 0 immediately. This would prevent other features from being enabled in this fake dirent, but that in itself is OK... but...
  • There may be a problem at some point in the future if there really is a filetype E and all the dirdata flags D are set (0x80, 0x40, 0x10 = EXT4_DIRENT_LUFID), so a simple check EXT4_FT_DIR_CSUM may get confused, so only skip this filetype if it is the last entry in the directory block.
Comment by Dongyang Li [ 05/Feb/19 ]

I like the last one which allows us to enable metadata_csum on the targets.

and yes only checking if file_type is EXT4_FT_DIR_CSUM is not enough.

however within ext4_get_dirent_data_len() we can not figure out if the de is the last in the block, we need the beginning of the de block to determine that.

we can check other things instead, if it is indeed a ext4_dir_entry_tail, then the inode number should be 0, also the name len. and the rec_len should be sizeof(struct ext4_dir_entry_tail) as well as the file_type should be 0xde. I think this is enough for us to return 0 in ext4_get_dirent_data_len().

Comment by Dongyang Li [ 06/Feb/19 ]

I made the changes mentioned in the previous comment, the targets can be mounted now.

however when I started mdtest from the client it ran into problems:

[439597.196196] LDISKFS-fs error (device vdb): ldiskfs_dx_csum_set:473: inode #524298: comm mdt00_003: dir seems corrupt?  Run e2fsck -D.
[439597.199074] Aborting journal on device vdb-8.
[439597.200800] LDISKFS-fs (vdb): Remounting filesystem read-only
[439597.202036] LDISKFS-fs error (device vdb) in ldiskfs_reserve_inode_write:5313: Journal has aborted
[439597.203684] LDISKFS-fs error (device vdb) in iam_txn_add:575: Journal has aborted
[439597.204777] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
[439597.205744] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) Skipped 1 previous similar message
[439597.206623] LustreError: 14478:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff8ecfd7315400
[439597.207585] LustreError: 14478:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) testfs-MDT0000: can't update reply_data file: rc = -30
[439597.209803] LustreError: 14478:0:(osd_handler.c:2008:osd_trans_stop()) testfs-MDT0000: failed in transaction hook: rc = -30
[439597.211020] LustreError: 14478:0:(osd_handler.c:2018:osd_trans_stop()) testfs-MDT0000: failed to stop transaction: rc = -30
[439597.211149] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ecfdfb52700 commit error: 2
[439597.211151] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) Skipped 183 previous similar messages

and mdtest on a ldiskfs mount is showing some different errors:

[450387.326280] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[450485.679052] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum.  Please run e2fsck -D.
[450485.679055] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum.  Please run e2fsck -D.
[450485.679058] LDISKFS-fs error (device loop0): __ldiskfs_read_dirblock:1110: inode #524290: block 508: comm mdtest: Directory index failed checksum

I need to look further into it.

Comment by Mahmoud Hanafi [ 08/Feb/19 ]

Just ran into to this on a newly created filesystem. Lustre2.12 and e2fsprogs-1.44.5.wc1-0.el7.x86_64. Removing the metadata_csum from the /etc/mke2fs.conf the workaround?

Comment by Gerrit Updater [ 09/Feb/19 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/34219
Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 466a6c245551fde0b414956e236a0a383a39d3c0

Comment by Dongyang Li [ 09/Feb/19 ]

We need to do more testing on this, especially performance tests. I've just done some simple mdtest.

@Mahmoud correct, for now you can removing metadata_csum from /etc/mke2fs.conf.

Comment by Gerrit Updater [ 15/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34219/
Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ec7a166a498be607c3882ff11e98b625839e69d0

Comment by Peter Jones [ 16/Mar/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35833
Subject: LU-11922 ldiskfs: make dirdata work with metadata_csum
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 12e2fe9da3e7f80177070a9e5f2906ea5d5cd4f1

Comment by Andreas Dilger [ 28/May/20 ]

Mahmoud, could you please comment on how you are seeing the metadata_csum feature being enabled for your filesystem?  Is it possible that you are supplying additional formatting options to mkfs.lustre that would enable metadata_csum?

Lustre does not enable this feature in mkfs.lustre, since this feature has never been tested, and the patch here is meant only to fix an obvious bug in the combination of metadata_csum and dirdata, but is not in any way an endorsement of the use of metadata_csum. The use of metadata checksums is surprisingly less useful for ldiskfs than it is for e.g. ZFS, because there is no backup copy of the metadata that can be used to recover from checksum errors, as there is in ZFS.

In the case where a checksum error is hit by the mounted filesystem, the best that it can do is report an error and make the filesystem read-only, and e2fsck only has the option of recalculating the checksum based on the current metadata contents, or considering the metadata corrupt and discarding the inode/block/directory entirely. In essence, recomputing the checksum is no better than the kernel just ignoring the bad checksum and continuing on to use the metadata as-is, except with a system outage in the middle. Since ldiskfs already validates metadata content (since metadata_csum is only a recent addition), it can already typically determine whether the content is corrupted. ZFS on the other hand blindly assumes that if the block checksum is correct that the contents must be valid, and will happily use bad data within the block (e.g. dereference index values read from disk that exceed the bounds of an array).

Comment by Mahmoud Hanafi [ 29/May/20 ]

The format option was getting picked up from /etc/mke2fs.conf. I just removed the option from the file as workaround.

Comment by Andreas Dilger [ 29/May/20 ]

Sure, I understand that the option was coming from /etc/mke2fs.conf.

My question is why was metadata_csum taken from /etc/mke2fs.conf? mkfs.lustre doesn't specify any options to mke2fs that will normally cause this feature to be used. Did you have a custom mke2fs.conf file that added it to the [defaults] section, or specify some extra option like "mkfs.lustre --mkfsoptions='-t ext4'" that took this option from the "[fstypes].ext4" section, as was reported in this ticket originally? Do you have the output from mkfs.lustre that shows the command-line options for mke2fs?

Comment by Mahmoud Hanafi [ 29/May/20 ]

Oh yes we do pass "-t ext4" during our format. We use csv with lustre_configure. Here is an example. (IP address are blocked out.)

service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-MGS0,/mnt/lustre/nbp10_1-MGS0,mgs,"nbp10",x.x.x.x@o2ib:x.x.x.x@o2ib,,,"-m 0 ","errors=panic,user_xattr,max_sectors_kb=0",x.x.x.159@o2ib:x.x.x.x@o2ib
service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-MDT0,/mnt/lustre/nbp10_1-MDT0,mdt,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",0,,"-m 0 -N 200000000 -t ext4","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib
service403-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_2-MDT1,/mnt/lustre/nbp10_2-MDT1,mdt,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",1,,"-m 0 -N 200000000 -t ext4","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib
service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-OST0,/mnt/lustre/nbp10_1-OST0,ost,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",0,,"-m 0 -N 34000000  -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib
service403-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_2-OST1,/mnt/lustre/nbp10_2-OST1,ost,nbp10,"x.x.x.x@o2ib:10.x.26.x@o2ib",1,,"-m 0 -N 34000000  -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib
Comment by Andreas Dilger [ 30/May/20 ]

What is the reason for adding "-t ext4"? That enables variable and untested features to the filesystem, as shown by this ticket, and is not required AFAIK.

Generated at Sat Feb 10 02:48:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.