[LU-16082] old-style Lustre EA inodes support is broken in newer kernel Created: 09/Aug/22 Updated: 13/Jan/23 Resolved: 22/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
the system upgraded from Lustre/rhel7.6/kernel-3.10 experiences the following problems: files unaccessible with "Structure needs cleaning" error , kernel log at MDS is full of [Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430381: comm mdt04_005: EA inode hash validation failed [Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430355: comm mdt04_032: EA inode hash validation failed [Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430357: comm mdt04_013: EA inode hash validation failed [Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430383: comm mdt07_024: EA inode hash validation failed |
| Comments |
| Comment by Alexander Zarochentsev [ 09/Aug/22 ] |
|
reproducing the issue: create /foo file with user.fooattr xattr in a system running rhel7.6: [root@devvm1 tests]# setfattr -n user.fooattr -v $(perl -e 'print "x" x 5000') /mnt/lustre/foo [root@devvm1 tests]# sync [root@devvm1 tests]# debugfs -c /dev/mapper/ centos-root centos-swap control mds1_flakey mds2_flakey ost1_flakey ost2_flakey [root@devvm1 tests]# debugfs -c /dev/mapper/mds1_flakey debugfs 1.46.2.wc1 (28-Feb-2021) /dev/mapper/mds1_flakey: catastrophic mode - not reading inode or group bitmaps debugfs: stat /ROOT/foo Inode: 184 Type: regular Mode: 0644 Flags: 0x0 Generation: 4249414524 Version: 0x00000001:00000008 User: 0 Group: 0 Project: 0 Size: 0 File ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x62f15799:00000000 -- Mon Aug 8 21:36:09 2022 atime: 0x62f15696:00000000 -- Mon Aug 8 21:31:50 2022 mtime: 0x62f15696:00000000 -- Mon Aug 8 21:31:50 2022 crtime: 0x62f15696:c358e084 -- Mon Aug 8 21:31:50 2022 Size of extra inode fields: 32 Extended attributes: lma: fid=[0x200000402:0x1:0x0] compat=0 incompat=0 trusted.lov (56) = d0 0b d1 0b 01 00 00 00 01 00 00 00 00 00 00 00 02 04 00 00 02 00 00 00 00 00 10 00 01 00 00 00 02 00 00 00 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 linkea: idx=0 parent=[0x200000007:0x1:0x0] name='foo' trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 inode <185> user.fooattr (5000) BLOCKS: debugfs: stat <185> Inode: 185 Type: regular Mode: 0600 Flags: 0x200000 Generation: 4249414524 Version: 0x00000000:00000000 User: 0 Group: 0 Project: 0 Size: 5000 File ACL: 0 Links: 1 Blockcount: 16 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x00000000:e7ec0af4 -- Thu Jan 1 03:00:00 1970 atime: 0x00000000:e7ec0af4 -- Thu Jan 1 03:00:00 1970 mtime: 0x000000b8:e7ec0af4 -- Thu Jan 1 03:03:04 1970 crtime: 0x62f15799:e7ec0af4 -- Mon Aug 8 21:36:09 2022 Size of extra inode fields: 32 BLOCKS: (0-1):32590-32591 TOTAL: 2 debugfs: inode_dump <184> 0000 a481 0000 0000 0000 9656 f162 9957 f162 .........V.b.W.b 0020 9656 f162 0000 0000 0000 0100 0000 0000 .V.b............ 0040 0000 0000 0800 0000 0000 0000 0000 0000 ................ 0060 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 7ceb 48fd 0000 0000 0000 0000 ....|.H......... 0160 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0200 2000 0000 0000 0000 0000 0000 0000 0000 ............... 0220 9656 f162 84e0 58c3 0100 0000 0000 0000 .V.b..X......... 0240 0000 02ea 0304 4403 0000 0000 1800 0000 ......D......... 0260 0000 0000 6c6d 6100 0304 0c03 0000 0000 ....lma......... 0300 3800 0000 0000 0000 6c6f 7600 0404 dc02 8.......lov..... 0320 0000 0000 2d00 0000 0000 0000 6c69 6e6b ....-.......link 0340 0304 c402 0000 0000 1800 0000 0000 0000 ................ 0360 736f 6d00 0701 0000 b900 0000 8813 0000 som............. 0400 0000 0000 666f 6f61 7474 7200 0000 0000 ....fooattr..... 0420 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 1540 0000 0000 0000 0000 0400 0000 0000 0000 ................ 1560 0000 0000 0000 0000 0000 0000 0000 0000 ................ 1600 dff1 ea11 0100 0000 2d00 0000 0000 0000 ........-....... 1620 0000 0000 0000 0000 0015 0000 0002 0000 ................ 1640 0007 0000 0001 0000 0000 666f 6f00 0000 ..........foo... 1660 d00b d10b 0100 0000 0100 0000 0000 0000 ................ 1700 0204 0000 0200 0000 0000 1000 0100 0000 ................ 1720 0200 0000 0000 0000 0000 0000 0000 0000 ................ 1740 0000 0000 0100 0000 0000 0000 0000 0000 ................ 1760 0204 0000 0200 0000 0100 0000 0000 0000 ................ debugfs: [root@devvm1 tests]# please note that xattr_enty->e_hash and ea_inode -> atime.tv_sec both are 0. after copying mdt1 image to a system with kernel-4.18 and mounting the image as ldiskfs: [root@devvm3 lustre-wc-rel]# getfattr -n user.fooattr /mnt/ROOT/foo /mnt/ROOT/foo: user.fooattr: Structure needs cleaning [root@devvm3 lustre-wc-rel]# uname -a Linux devvm3.localnet 4.18.0-305.10.2.x6.0.24.x86_64 #1 SMP Fri Oct 1 07:51:49 MDT 2021 x86_64 x86_64 x86_64 GNU/Linux [root@devvm3 lustre-wc-rel]# |
| Comment by Alexander Zarochentsev [ 09/Aug/22 ] |
|
the culprit is the following code in fs/ext4/xattr.c :
/*
* Check whether this is an old Lustre-style xattr inode. Lustre
* implementation does not have hash validation, rather it has a
* backpointer from ea_inode to the parent inode.
*/
if (ea_inode_hash != ext4_xattr_inode_get_hash(inode) &&
EXT4_XATTR_INODE_GET_PARENT(inode) == parent->i_ino &&
inode->i_generation == parent->i_generation) {
ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
ext4_xattr_inode_set_ref(inode, 1);
} else {
inode_lock(inode);
inode->i_flags |= S_NOQUOTA;
inode_unlock(inode);
}
the old-style Lustre EA support implementation ldiskfs_xattr_inode_hash(struct ldiskfs_sb_info *sbi, const void *buffer, size_t size) { if (ldiskfs_has_metadata_csum(sbi->s_sb)) return ldiskfs_chksum(sbi, sbi->s_csum_seed, buffer, size); return 0; } And it is not easy to turn it on as it is not compatible with "dirdata" feature. Assuming it is 0 in all cases. the xattr_entry->e_hash field is also 4-bytes 0 in most cases (except xattr entries in an external xattr block). So the check in newer kernels always fails to recognize old-style EA inodes and then causes access failures as the hashes for the buffer and from the inode->i_atime.tv_sec do not match. |
| Comment by Gerrit Updater [ 09/Aug/22 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48174 |
| Comment by Cory Spitz [ 09/Aug/22 ] |
|
Note: there was some previous discussion about this in |
| Comment by Andreas Dilger [ 10/Aug/22 ] |
|
Is there something that we can do to fix the existing code to format large xattrs better and/or fix existing xattrs in-place (LFSCK or e2fsck?) to avoid this error? Otherwise we need to patch ldiskfs to fix this issue going forward indefinitely (as long as any filesystem continues to exist that was mounted via el7.9). |
| Comment by Alexander Zarochentsev [ 12/Aug/22 ] |
|
>Is there something that we can do to fix the existing code to format large xattrs better and/or fix existing xattrs in-place (LFSCK or e2fsck?) to avoid this error? I am only thinking of setting EXT4_STATE_LUSTRE_EA_INODE permanently by an LFSCK/scrub. |
| Comment by Andreas Dilger [ 13/Aug/22 ] |
EXT4_STATE_LUSTRE_EA_INODE is an in-memory flag, so it can't be set permanently. That is the purpose of the code being patched - to detect the special case when this flag should be set without consuming an extra on-disk flag for it. I was thinking more of setting the xattr hash properly in the inode, but that would probably be more trouble than it is worth. I think it would be better to keep a patch for ldiskfs, and fix the code in upstream ext4. The main unanswered question I think is what to do for filesystems that have metadata checksums enabled? I was thinking that the current patch needs to be changed so that it isn't broken for filesystems with metadata_csum, but I don't think that would be correct either. The metadata csum flag is per-filesystem so checking if it is enabled/disabled is not helpful for inodes that were written before the feature was enabled. By the time we implement support for metadata csum + dirdata in ldiskfs it will be at least RHEL8 on the server and it will be using the new format xattr inode, so there is no point to "fix" these xattrs in-place as long as ldiskfs continues to support these xattrs properly. |
| Comment by Andreas Dilger [ 13/Aug/22 ] |
|
PS: please also add a test case to conf-sanity test_32 that creates a large xattr on a file and verify that it can be accessed after update, probably using overstriping so that . This should be done with one of the older disk images (2.10 or 2.12) and el7.9 servers so that it uses the old format, and it should be caught by current testing with el8.5 servers. |
| Comment by Gerrit Updater [ 26/Aug/22 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48350 |
| Comment by Alexander Zarochentsev [ 31/Aug/22 ] |
|
Test failure with the patch https://review.whamcloud.com/48350 : == check Large EA == /tmp/t32/mnt/lustre/large_xattr_test_dir/large_xattr_file: user.fooattr: Structure needs cleaning conf-sanity test_32g: @@@@@@ FAIL: Large EA cannot be read Trace dump: = ./../tests/test-framework.sh:6505:error_noexit() = conf-sanity.sh:2541:t32_test() = conf-sanity.sh:2980:test_32g() = ./../tests/test-framework.sh:6858:run_one() = ./../tests/test-framework.sh:6908:run_one_logged() = ./../tests/test-framework.sh:6730:run_test() = conf-sanity.sh:2985:main() Dumping lctl log to /tmp/test_logs/1661958175/conf-sanity.test_32g.*.1661958356.log Dumping logs only on local client. == cleanup with rc=0 == |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48174/ |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48412 |
| Comment by Gerrit Updater [ 09/Sep/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48496 |
| Comment by Gerrit Updater [ 09/Sep/22 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48498 |
| Comment by Gerrit Updater [ 17/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48496/ |
| Comment by Peter Jones [ 17/Sep/22 ] |
|
Does the patch that just merged replace the earlier ones pushed or are they all needed? |
| Comment by Alexander Zarochentsev [ 17/Sep/22 ] |
|
> Does the patch that just merged replace the earlier ones pushed or are they all needed? |
| Comment by Gerrit Updater [ 20/Sep/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48611 |
| Comment by Gerrit Updater [ 26/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48611/ |
| Comment by Peter Jones [ 22/Nov/22 ] |
|
Test patch still to come but code changes merged |
| Comment by Gerrit Updater [ 13/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48350/ |