[LU-16082] old-style Lustre EA inodes support is broken in newer kernel Created: 09/Aug/22  Updated: 13/Jan/23  Resolved: 22/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Bug Priority: Critical
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Attachments: File test_32newtarball_2.12.0.diff    
Issue Links:
Related
is related to LU-15171 corrupted in-inode xattr Resolved
is related to LU-11643 create disk images for Lustre 2.10 an... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

the system upgraded from Lustre/rhel7.6/kernel-3.10 experiences the following problems:

files unaccessible with "Structure needs cleaning" error , kernel log at MDS is full of

[Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430381: comm mdt04_005: EA inode hash validation failed
[Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430355: comm mdt04_032: EA inode hash validation failed
[Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430357: comm mdt04_013: EA inode hash validation failed
[Tue Aug 2 11:15:32 2022] LDISKFS-fs warning (device md1): ldiskfs_xattr_inode_get:497: inode #4290430383: comm mdt07_024: EA inode hash validation failed


 Comments   
Comment by Alexander Zarochentsev [ 09/Aug/22 ]

reproducing the issue:

create /foo file with user.fooattr xattr in a system running rhel7.6:

[root@devvm1 tests]# setfattr -n user.fooattr -v $(perl -e 'print "x" x 5000') /mnt/lustre/foo
[root@devvm1 tests]# sync
[root@devvm1 tests]# debugfs -c /dev/mapper/
centos-root  centos-swap  control      mds1_flakey  mds2_flakey  ost1_flakey  ost2_flakey  
[root@devvm1 tests]# debugfs -c /dev/mapper/mds1_flakey 
debugfs 1.46.2.wc1 (28-Feb-2021)
/dev/mapper/mds1_flakey: catastrophic mode - not reading inode or group bitmaps
debugfs:  stat /ROOT/foo
Inode: 184   Type: regular    Mode:  0644   Flags: 0x0
Generation: 4249414524    Version: 0x00000001:00000008
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x62f15799:00000000 -- Mon Aug  8 21:36:09 2022
 atime: 0x62f15696:00000000 -- Mon Aug  8 21:31:50 2022
 mtime: 0x62f15696:00000000 -- Mon Aug  8 21:31:50 2022
crtime: 0x62f15696:c358e084 -- Mon Aug  8 21:31:50 2022
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x200000402:0x1:0x0] compat=0 incompat=0
  trusted.lov (56) = d0 0b d1 0b 01 00 00 00 01 00 00 00 00 00 00 00 02 04 00 00 02 00 00 00 00 00 10 00 01 00 00 00 02 00 00 00 0
0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 
  linkea: idx=0 parent=[0x200000007:0x1:0x0] name='foo'
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
  inode <185> user.fooattr (5000)
BLOCKS:

debugfs:  stat <185>
Inode: 185   Type: regular    Mode:  0600   Flags: 0x200000
Generation: 4249414524    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 5000
File ACL: 0
Links: 1   Blockcount: 16
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x00000000:e7ec0af4 -- Thu Jan  1 03:00:00 1970
 atime: 0x00000000:e7ec0af4 -- Thu Jan  1 03:00:00 1970
 mtime: 0x000000b8:e7ec0af4 -- Thu Jan  1 03:03:04 1970
crtime: 0x62f15799:e7ec0af4 -- Mon Aug  8 21:36:09 2022
Size of extra inode fields: 32
BLOCKS:
(0-1):32590-32591
TOTAL: 2

debugfs:  inode_dump <184>
0000  a481 0000 0000 0000 9656 f162 9957 f162  .........V.b.W.b
0020  9656 f162 0000 0000 0000 0100 0000 0000  .V.b............
0040  0000 0000 0800 0000 0000 0000 0000 0000  ................
0060  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 7ceb 48fd 0000 0000 0000 0000  ....|.H.........
0160  0000 0000 0000 0000 0000 0000 0000 0000  ................
0200  2000 0000 0000 0000 0000 0000 0000 0000   ...............
0220  9656 f162 84e0 58c3 0100 0000 0000 0000  .V.b..X.........
0240  0000 02ea 0304 4403 0000 0000 1800 0000  ......D.........
0260  0000 0000 6c6d 6100 0304 0c03 0000 0000  ....lma.........
0300  3800 0000 0000 0000 6c6f 7600 0404 dc02  8.......lov.....
0320  0000 0000 2d00 0000 0000 0000 6c69 6e6b  ....-.......link
0340  0304 c402 0000 0000 1800 0000 0000 0000  ................
0360  736f 6d00 0701 0000 b900 0000 8813 0000  som.............
0400  0000 0000 666f 6f61 7474 7200 0000 0000  ....fooattr.....
0420  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
1540  0000 0000 0000 0000 0400 0000 0000 0000  ................
1560  0000 0000 0000 0000 0000 0000 0000 0000  ................
1600  dff1 ea11 0100 0000 2d00 0000 0000 0000  ........-.......
1620  0000 0000 0000 0000 0015 0000 0002 0000  ................
1640  0007 0000 0001 0000 0000 666f 6f00 0000  ..........foo...
1660  d00b d10b 0100 0000 0100 0000 0000 0000  ................
1700  0204 0000 0200 0000 0000 1000 0100 0000  ................
1720  0200 0000 0000 0000 0000 0000 0000 0000  ................
1740  0000 0000 0100 0000 0000 0000 0000 0000  ................
1760  0204 0000 0200 0000 0100 0000 0000 0000  ................

debugfs:  [root@devvm1 tests]# 

please note that xattr_enty->e_hash and ea_inode -> atime.tv_sec both are 0.

after copying mdt1 image to a system with kernel-4.18 and mounting the image as ldiskfs:

[root@devvm3 lustre-wc-rel]# getfattr -n user.fooattr /mnt/ROOT/foo 
/mnt/ROOT/foo: user.fooattr: Structure needs cleaning
[root@devvm3 lustre-wc-rel]# uname -a
Linux devvm3.localnet 4.18.0-305.10.2.x6.0.24.x86_64 #1 SMP Fri Oct 1 07:51:49 MDT 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@devvm3 lustre-wc-rel]# 
Comment by Alexander Zarochentsev [ 09/Aug/22 ]

the culprit is the following code in fs/ext4/xattr.c :

/*
         * Check whether this is an old Lustre-style xattr inode. Lustre
         * implementation does not have hash validation, rather it has a
         * backpointer from ea_inode to the parent inode.
         */
        if (ea_inode_hash != ext4_xattr_inode_get_hash(inode) &&
            EXT4_XATTR_INODE_GET_PARENT(inode) == parent->i_ino &&
            inode->i_generation == parent->i_generation) {
                ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
                ext4_xattr_inode_set_ref(inode, 1);
        } else {
                inode_lock(inode);
                inode->i_flags |= S_NOQUOTA;
                inode_unlock(inode);
        }

the old-style Lustre EA support implementation
sets i_atime.tv_sec 0 by Lustre-own EA b/c the hash function returns 0:

ldiskfs_xattr_inode_hash(struct ldiskfs_sb_info *sbi, const void *buffer, size_t size)
{
        if (ldiskfs_has_metadata_csum(sbi->s_sb))
                return ldiskfs_chksum(sbi, sbi->s_csum_seed, buffer, size);
        return 0;
}

And it is not easy to turn it on as it is not compatible with "dirdata" feature. Assuming it is 0 in all cases. the xattr_entry->e_hash field is also 4-bytes 0 in most cases (except xattr entries in an external xattr block). So the check in newer kernels always fails to recognize old-style EA inodes and then causes access failures as the hashes for the buffer and from the inode->i_atime.tv_sec do not match.

Comment by Gerrit Updater [ 09/Aug/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48174
Subject: LU-16082 ldiskfs: old-style EA inode handling fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dba5481acb9573f89389cb41fe93a62d926259e2

Comment by Cory Spitz [ 09/Aug/22 ]

Note: there was some previous discussion about this in LU-15171, https://jira.whamcloud.com/browse/LU-15171?focusedCommentId=317067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-317067

Comment by Andreas Dilger [ 10/Aug/22 ]

Is there something that we can do to fix the existing code to format large xattrs better and/or fix existing xattrs in-place (LFSCK or e2fsck?) to avoid this error? Otherwise we need to patch ldiskfs to fix this issue going forward indefinitely (as long as any filesystem continues to exist that was mounted via el7.9).

Comment by Alexander Zarochentsev [ 12/Aug/22 ]

>Is there something that we can do to fix the existing code to format large xattrs better and/or fix existing xattrs in-place (LFSCK or e2fsck?) to avoid this error?

I am only thinking of setting EXT4_STATE_LUSTRE_EA_INODE permanently by an LFSCK/scrub.

Comment by Andreas Dilger [ 13/Aug/22 ]

I am only thinking of setting EXT4_STATE_LUSTRE_EA_INODE permanently by an LFSCK/scrub.

EXT4_STATE_LUSTRE_EA_INODE is an in-memory flag, so it can't be set permanently. That is the purpose of the code being patched - to detect the special case when this flag should be set without consuming an extra on-disk flag for it.

I was thinking more of setting the xattr hash properly in the inode, but that would probably be more trouble than it is worth. I think it would be better to keep a patch for ldiskfs, and fix the code in upstream ext4. The main unanswered question I think is what to do for filesystems that have metadata checksums enabled?

I was thinking that the current patch needs to be changed so that it isn't broken for filesystems with metadata_csum, but I don't think that would be correct either. The metadata csum flag is per-filesystem so checking if it is enabled/disabled is not helpful for inodes that were written before the feature was enabled. By the time we implement support for metadata csum + dirdata in ldiskfs it will be at least RHEL8 on the server and it will be using the new format xattr inode, so there is no point to "fix" these xattrs in-place as long as ldiskfs continues to support these xattrs properly.

Comment by Andreas Dilger [ 13/Aug/22 ]

PS: please also add a test case to conf-sanity test_32 that creates a large xattr on a file and verify that it can be accessed after update, probably using overstriping so that . This should be done with one of the older disk images (2.10 or 2.12) and el7.9 servers so that it uses the old format, and it should be caught by current testing with el8.5 servers.

Comment by Gerrit Updater [ 26/Aug/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48350
Subject: LU-16082 ldiskfs: Large EA upgrade test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5711c926acc327d3d77645485df555a0a42fd390

Comment by Alexander Zarochentsev [ 31/Aug/22 ]

Test failure with the patch https://review.whamcloud.com/48350 :

== check Large EA ==
/tmp/t32/mnt/lustre/large_xattr_test_dir/large_xattr_file: user.fooattr: Structure needs cleaning
 conf-sanity test_32g: @@@@@@ FAIL: Large EA cannot be read 
  Trace dump:
  = ./../tests/test-framework.sh:6505:error_noexit()
  = conf-sanity.sh:2541:t32_test()
  = conf-sanity.sh:2980:test_32g()
  = ./../tests/test-framework.sh:6858:run_one()
  = ./../tests/test-framework.sh:6908:run_one_logged()
  = ./../tests/test-framework.sh:6730:run_test()
  = conf-sanity.sh:2985:main()
Dumping lctl log to /tmp/test_logs/1661958175/conf-sanity.test_32g.*.1661958356.log
Dumping logs only on local client.
== cleanup with rc=0 ==

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48174/
Subject: LU-16082 ldiskfs: old-style EA inode handling fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 76c3fa96dc30f21e95d80f9119972d7358975258

Comment by Gerrit Updater [ 01/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48412
Subject: LU-16082 ldiskfs: old-style EA inode handling fix
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 31a73bf8c3371c44628e584d7e2234440dcc13f9

Comment by Gerrit Updater [ 09/Sep/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48496
Subject: LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 235395d7b39adcc2c21006a44b27788ebf83e631

Comment by Gerrit Updater [ 09/Sep/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48498
Subject: LU-16082 ldiskfs: more targets for old ea_inode fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b387b80f9dcffbdc484283b62dec0792c6c452fb

Comment by Gerrit Updater [ 17/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48496/
Subject: LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ba9845274c8ea5c55f57b7fa0e839f18d76031ea

Comment by Peter Jones [ 17/Sep/22 ]

Does the patch that just merged replace the earlier ones pushed or are they all needed?

Comment by Alexander Zarochentsev [ 17/Sep/22 ]

> Does the patch that just merged replace the earlier ones pushed or are they all needed?
each patch adds more rhel8 targets.

Comment by Gerrit Updater [ 20/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48611
Subject: LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: c4dcdeff3f5514fc68e71f6a7b911298932b13c5

Comment by Gerrit Updater [ 26/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48611/
Subject: LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 818e2cc867f948693bec9c5d64314c7546dc858d

Comment by Peter Jones [ 22/Nov/22 ]

Test patch still to come but code changes merged

Comment by Gerrit Updater [ 13/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48350/
Subject: LU-16082 ldiskfs: Large EA upgrade test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ffef81008a290dbfeeea86944d03f12c290871b

Generated at Sat Feb 10 03:23:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.