[LU-11922] mkfs.lustre in 1.44.3.wc1 causes corruption if 'metadata_csum' option enabled Created: 05/Feb/19 Updated: 05/Oct/20 Resolved: 16/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12, ldiskfs, nasa | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
# mkfs.lustre --mgs --fsname=lustre /dev/sda # mkfs.lustre --ost --servicenode=172.16.251.20@o2ib --mgsnode=172.16.251.20@o2ib --fsname=lustre --index=0 --reformat --mkfsoptions='-text4' /dev/sdc # mount -t lustre /dev/sda /tmp/mgs/ # mount -t lustre /dev/sdc /tmp/ost0/ Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): file extents enabled, maximum tree depth=5
Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs error (device sdc): htree_dirblock_to_tree:1278: inode #11272193: block 721437184:
comm mount.lustre: bad entry in directory: rec_len is too small for name_len - offset=4084(4084), inode=0, rec_len=12, name_len=0
Feb 5 06:08:59 ai200-7f94-vm00 kernel: Aborting journal on device sdc-8.
Feb 5 06:08:59 ai200-7f94-vm00 kernel: LDISKFS-fs (sdc): Remounting filesystem read-only
Feb 5 06:09:04 ai200-7f94-vm00 kernel: LDISKFS-fs warning (device sdc): kmmpd:187: kmmpd being stopped since filesystem has been remounted as readonly.
Without '-t ext4', it was no problem. |
| Comments |
| Comment by Andreas Dilger [ 05/Feb/19 ] |
|
I'm able to reproduce this locally. It looks like it is caused by the use of metadata_csum, which is inherited from /etc/mke2fs.conf when "-t ext4" is enabled. Removing metadata_csum from /etc/mke2fs.conf avoids the problem. formatting backing filesystem ldiskfs on /dev/loop0
target name testfs:OST0000
kilobytes 400000
options -t ext4 -I 512 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg
-G 256 -E resize="4290772992",lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0000 -t ext4 -I 512 -q
-O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize="4290772992",lazy_journal_init -F /dev/loop0 400000k
Filesystem volume name: testfs:OST0000
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent 64bit flex_bg sparse_super large_file
huge_file dir_nlink extra_isize quota metadata_csum
Checksum type: crc32c
Checksum: 0x4ae324f9
The later "-O" features that mkfs.lustre enables to not disable the default features that /etc/mke2fs.conf enables. I suspect it is some kind of bad interaction between the dir_data feature (even though it is not enabled on the OST) and the metadata_csum feature. Using metadata_csum stores a checksum in a fake directory entry at the end of each directory block. The dirdata feature tries to reserve enough space in the dirent to store extra FID data, so my guess is that htree_dirblock_to_tree() is trying to get enough space for the dirdata field, even though this feature is disabled on the OST. Looking at the code more closely, it definitely seems that this is the case:
/*
* This is a bogus directory entry at the end of each leaf block that
* records checksums.
*/
struct ext4_dir_entry_tail {
__le32 det_reserved_zero1; /* Pretend to be unused */
__le16 det_rec_len; /* 12 */
__u8 det_reserved_zero2; /* Zero name length */
__u8 det_reserved_ft; /* 0xDE, fake file type */
__le32 det_checksum; /* crc32c(uuid+inum+dirblock) */
};
Sadly, it seems that setting det_reserved_ft = 0xde is causing EXT4_DIR_REC_LEN(de)->ext4_get_dirent_data_len() to think there are extra dirdata records in the directory, adding bogus record lengths onto the dirent and triggering this error. This is quite problematic, because the metadata_csum feature is already the default for e2fsprogs-1.44, and the kernel code using det_reserved_ft = 0xde has been in ext4 since commit v3.4-rc5-2-ge61539189606, so there is no way to fix this to work transparently with dirdata. I think a few options exist:
|
| Comment by Dongyang Li [ 05/Feb/19 ] |
|
I like the last one which allows us to enable metadata_csum on the targets. and yes only checking if file_type is EXT4_FT_DIR_CSUM is not enough. however within ext4_get_dirent_data_len() we can not figure out if the de is the last in the block, we need the beginning of the de block to determine that. we can check other things instead, if it is indeed a ext4_dir_entry_tail, then the inode number should be 0, also the name len. and the rec_len should be sizeof(struct ext4_dir_entry_tail) as well as the file_type should be 0xde. I think this is enough for us to return 0 in ext4_get_dirent_data_len(). |
| Comment by Dongyang Li [ 06/Feb/19 ] |
|
I made the changes mentioned in the previous comment, the targets can be mounted now. however when I started mdtest from the client it ran into problems: [439597.196196] LDISKFS-fs error (device vdb): ldiskfs_dx_csum_set:473: inode #524298: comm mdt00_003: dir seems corrupt? Run e2fsck -D. [439597.199074] Aborting journal on device vdb-8. [439597.200800] LDISKFS-fs (vdb): Remounting filesystem read-only [439597.202036] LDISKFS-fs error (device vdb) in ldiskfs_reserve_inode_write:5313: Journal has aborted [439597.203684] LDISKFS-fs error (device vdb) in iam_txn_add:575: Journal has aborted [439597.204777] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 [439597.205744] LustreError: 14478:0:(osd_io.c:2008:osd_ldiskfs_write_record()) Skipped 1 previous similar message [439597.206623] LustreError: 14478:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff8ecfd7315400 [439597.207585] LustreError: 14478:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) testfs-MDT0000: can't update reply_data file: rc = -30 [439597.209803] LustreError: 14478:0:(osd_handler.c:2008:osd_trans_stop()) testfs-MDT0000: failed in transaction hook: rc = -30 [439597.211020] LustreError: 14478:0:(osd_handler.c:2018:osd_trans_stop()) testfs-MDT0000: failed to stop transaction: rc = -30 [439597.211149] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ecfdfb52700 commit error: 2 [439597.211151] LustreError: 14367:0:(osd_handler.c:1708:osd_trans_commit_cb()) Skipped 183 previous similar messages and mdtest on a ldiskfs mount is showing some different errors: [450387.326280] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) [450485.679052] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum. Please run e2fsck -D. [450485.679055] LDISKFS-fs warning (device loop0): warn_no_space_for_csum:336: no space in directory inode 524290 leaf for checksum. Please run e2fsck -D. [450485.679058] LDISKFS-fs error (device loop0): __ldiskfs_read_dirblock:1110: inode #524290: block 508: comm mdtest: Directory index failed checksum I need to look further into it. |
| Comment by Mahmoud Hanafi [ 08/Feb/19 ] |
|
Just ran into to this on a newly created filesystem. Lustre2.12 and e2fsprogs-1.44.5.wc1-0.el7.x86_64. Removing the metadata_csum from the /etc/mke2fs.conf the workaround? |
| Comment by Gerrit Updater [ 09/Feb/19 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/34219 |
| Comment by Dongyang Li [ 09/Feb/19 ] |
|
We need to do more testing on this, especially performance tests. I've just done some simple mdtest. @Mahmoud correct, for now you can removing metadata_csum from /etc/mke2fs.conf. |
| Comment by Gerrit Updater [ 15/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34219/ |
| Comment by Peter Jones [ 16/Mar/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 19/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35833 |
| Comment by Andreas Dilger [ 28/May/20 ] |
|
Mahmoud, could you please comment on how you are seeing the metadata_csum feature being enabled for your filesystem? Is it possible that you are supplying additional formatting options to mkfs.lustre that would enable metadata_csum? Lustre does not enable this feature in mkfs.lustre, since this feature has never been tested, and the patch here is meant only to fix an obvious bug in the combination of metadata_csum and dirdata, but is not in any way an endorsement of the use of metadata_csum. The use of metadata checksums is surprisingly less useful for ldiskfs than it is for e.g. ZFS, because there is no backup copy of the metadata that can be used to recover from checksum errors, as there is in ZFS. In the case where a checksum error is hit by the mounted filesystem, the best that it can do is report an error and make the filesystem read-only, and e2fsck only has the option of recalculating the checksum based on the current metadata contents, or considering the metadata corrupt and discarding the inode/block/directory entirely. In essence, recomputing the checksum is no better than the kernel just ignoring the bad checksum and continuing on to use the metadata as-is, except with a system outage in the middle. Since ldiskfs already validates metadata content (since metadata_csum is only a recent addition), it can already typically determine whether the content is corrupted. ZFS on the other hand blindly assumes that if the block checksum is correct that the contents must be valid, and will happily use bad data within the block (e.g. dereference index values read from disk that exceed the bounds of an array). |
| Comment by Mahmoud Hanafi [ 29/May/20 ] |
|
The format option was getting picked up from /etc/mke2fs.conf. I just removed the option from the file as workaround. |
| Comment by Andreas Dilger [ 29/May/20 ] |
|
Sure, I understand that the option was coming from /etc/mke2fs.conf. My question is why was metadata_csum taken from /etc/mke2fs.conf? mkfs.lustre doesn't specify any options to mke2fs that will normally cause this feature to be used. Did you have a custom mke2fs.conf file that added it to the [defaults] section, or specify some extra option like "mkfs.lustre --mkfsoptions='-t ext4'" that took this option from the "[fstypes].ext4" section, as was reported in this ticket originally? Do you have the output from mkfs.lustre that shows the command-line options for mke2fs? |
| Comment by Mahmoud Hanafi [ 29/May/20 ] |
|
Oh yes we do pass "-t ext4" during our format. We use csv with lustre_configure. Here is an example. (IP address are blocked out.) service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-MGS0,/mnt/lustre/nbp10_1-MGS0,mgs,"nbp10",x.x.x.x@o2ib:x.x.x.x@o2ib,,,"-m 0 ","errors=panic,user_xattr,max_sectors_kb=0",x.x.x.159@o2ib:x.x.x.x@o2ib service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-MDT0,/mnt/lustre/nbp10_1-MDT0,mdt,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",0,,"-m 0 -N 200000000 -t ext4","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib service403-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_2-MDT1,/mnt/lustre/nbp10_2-MDT1,mdt,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",1,,"-m 0 -N 200000000 -t ext4","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib service401-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_1-OST0,/mnt/lustre/nbp10_1-OST0,ost,nbp10,"x.x.x.x@o2ib:x.x.x.x@o2ib",0,,"-m 0 -N 34000000 -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib service403-ib1,"options lnet networks=o2ib(ib1)",/dev/mapper/nbp10_2-OST1,/mnt/lustre/nbp10_2-OST1,ost,nbp10,"x.x.x.x@o2ib:10.x.26.x@o2ib",1,,"-m 0 -N 34000000 -t ext4 -E packed_meta_blocks=1","acl,errors=panic,user_xattr,max_sectors_kb=0",x.x.x.x@o2ib:x.x.x.x@o2ib |
| Comment by Andreas Dilger [ 30/May/20 ] |
|
What is the reason for adding "-t ext4"? That enables variable and untested features to the filesystem, as shown by this ticket, and is not required AFAIK. |