[LU-17380] b2_15 sanity test suite running gets stuck at test 64e Created: 20/Dec/23  Updated: 27/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Xinliang Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

OpenEuler 22.03 LTS SP2
kernel 153.35.0.112.oe2203sp2+


Issue Links:
Related
is related to LU-16350 Updated server support for new linux ... Open
is related to LU-15542 lfsck have panic with ea inode linked... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running sanity test suite on openEuler 22.03 with update kernel 153.35.0.112.oe2203sp2,  lustre version is latest b2_15 2.15.4 rc1,  it gets stuck:

[openeuler@oe2203-test-02 ~]$ sudo  RUNAS_ID="1000" ~/lustre/lustre-release/lustre/tests/auster  -rvsk sanity --only 50-64
...
== sanity test 64b: check out-of-space detection on client ========================================================== 09:53:13 (1702979593)
STRIPECOUNT=2 ORIGFREE=551632 MAXFREE=800000
BEFORE dd started
lustre-OST0000 avl=273288 grnt=260362 diff=12926 limit=11131
lustre-OST0001 avl=278344 grnt=276135 diff=2209 limit=11131 FULL
lt-lfs setstripe: unable to open '/mnt/lustre/oosfile': Read-only file system (30)
  Trace dump:
  = /home/openeuler/lustre/lustre-release/lustre/tests/oos.sh:37:main()
oos: FAIL: test-framework exiting on error
 sanity test_64b: @@@@@@ FAIL: oos.sh failed: 30 
  Trace dump:
  = /home/openeuler/lustre/lustre-release/lustre/tests/test-framework.sh:6427:error()
  = /home/openeuler/lustre/lustre-release/lustre/tests/sanity.sh:8816:test_64b()
  = /home/openeuler/lustre/lustre-release/lustre/tests/test-framework.sh:6744:run_one()
  = /home/openeuler/lustre/lustre-release/lustre/tests/test-framework.sh:6791:run_one_logged()
  = /home/openeuler/lustre/lustre-release/lustre/tests/test-framework.sh:6617:run_test()
  = /home/openeuler/lustre/lustre-release/lustre/tests/sanity.sh:8818:main()
Dumping lctl log to /tmp/test_logs/2023-12-19/094338/sanity.test_64b.*.1702979595.log
Dumping logs only on local client.
FAIL 64b (3s)== sanity test 64c: verify grant shrink ================== 09:53:16 (1702979596)
osc.lustre-OST0000-osc-ffff3c548183a800.cur_grant_bytes=0
checking grant......UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       125056        8608      105212   8% /mnt/lustre[MDT:0] R
lustre-OST0000_UUID       313104       12656      273288   5% /mnt/lustre[OST:0] 
lustre-OST0001_UUID       313104        7600      278344   3% /mnt/lustre[OST:1] filesystem_summary:       626208       20256      551632   4% /mnt/lustrepass grant check: client:286814208 server:286814208
PASS 64c (0s)== sanity test 64d: check grant limit exceed ============= 09:53:16 (1702979596)
lt-lfs setstripe: unable to open '/mnt/lustre/f64d.sanity': Read-only file system (30)
dd: failed to open '/mnt/lustre/f64d.sanity': Read-only file system
/home/openeuler/lustre/lustre-release/lustre/tests/sanity.sh: line 8898: kill: (213825) - No such process
checking grant......UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       125056        8608      105212   8% /mnt/lustre[MDT:0] R
lustre-OST0000_UUID       313104       12656      273288   5% /mnt/lustre[OST:0] 
lustre-OST0001_UUID       313104        7600      278344   3% /mnt/lustre[OST:1] filesystem_summary:       626208       20256      551632   4% /mnt/lustrepass grant check: client:286814208 server:286814208
Waiting for MDT destroys to complete
PASS 64d (4s)== sanity test 64e: check grant consumption (no grant allocation) ========================================================== 09:53:20 (1702979600)
debug=+cache
Stopping client oe2203-test-02 /mnt/lustre (opts:)
Starting client: oe2203-test-02:  -o user_xattr,flock oe2203-test-02@tcp:/lustre /mnt/lustre
 **Stucks here **


 Comments   
Comment by Xinliang Liu [ 20/Dec/23 ]

Narrow down that it is relating to recently backported ext4 patch "ext4: add EA_INODE checking to ext4_iget()".

The checking to ext4_iget() is:

 

+static const char *check_igot_inode(struct inode *inode, ext4_iget_flags flags)
+
+{
+	if (flags & EXT4_IGET_EA_INODE) {
+		if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
+			return "missing EA_INODE flag";
+	} else {
+		if ((EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
+			return "unexpected EA_INODE flag";
+	}
+	if (is_bad_inode(inode) && !(flags & EXT4_IGET_BAD))
+		return "unexpected bad inode w/o EXT4_IGET_BAD";
+	return NULL;
+} 

And from the sanity running kernel log and osd_ldiskfs_iget() code part below, we can see that lustre seems doesn't allow to get EA inode with this ext4 patch "ext4: add EA_INODE checking to ext4_iget()"

[113677.017540] LDISKFS-fs error (device dm-0): osd_iget:503: inode #289: comm OI_scrub: unexpected EA_INODE flag
[113677.020146] LDISKFS-fs error (device dm-0): osd_iget:503: inode #289: comm lfsck: unexpected EA_INODE flag
[113677.021426] Aborting journal on device dm-0-8.
[113677.023049] LDISKFS-fs (dm-0): Remounting filesystem read-only
[113677.174850] LustreError: 136912:0:(osd_handler.c:1790:osd_trans_commit_cb()) transaction @0x00000000ce666d28 commit error: 2
[113677.196990] LustreError: 209429:0:(scrub.c:243:scrub_file_store()) lustre-MDT0000: store scrub file: rc = -30
 

lustre/osd-ldiskfs/osd_internal.h

 975 #ifdef HAVE_LDISKFS_IGET_WITH_FLAGS
 976 # define osd_ldiskfs_iget(sb, ino) \                                                                                                                        
 977                 ldiskfs_iget((sb), (ino), \
 978                              LDISKFS_IGET_HANDLE | LDISKFS_IGET_SPECIAL)
 979 #else
 980 # define osd_ldiskfs_iget(sb, ino) ldiskfs_iget((sb), (ino))
 981 #endif 

This is the root cause, osd_iget() fails and make lustre remounting read-only which break the later tests running.

To be noticed that the patch "ext4: add EA_INODE checking to ext4_iget()" is introduced in kernel 6.4. Kernel with this patch should have this issue, such as kernel v6.4+, v6.1.33, v5.15.116, openEuler 22.03 kernel 153.35.0.112.oe2203sp2, etc.

 

Comment by Xinliang Liu [ 20/Dec/23 ]

Maybe we should explicitly set the iget_flags(if possible) when calling function ldiskfs_iget() like ext4 does now , thus makes check_igot_inode() work.

linux$ grep "= ext4_iget(" -rn fs/ext4/ 
fs/ext4/ioctl.c:383:    inode_bl = ext4_iget(sb, EXT4_BOOT_LOADER_INO,
fs/ext4/resize.c:1720:          inode = ext4_iget(sb, EXT4_RESIZE_INO, EXT4_IGET_SPECIAL);
fs/ext4/resize.c:2061:                  resize_inode = ext4_iget(sb, EXT4_RESIZE_INO,
fs/ext4/orphan.c:586:   inode = ext4_iget(sb, orphan_ino, EXT4_IGET_SPECIAL);
fs/ext4/block_validity.c:160:   inode = ext4_iget(sb, ino, EXT4_IGET_SPECIAL);
fs/ext4/super.c:1550:   inode = ext4_iget(sb, ino, EXT4_IGET_HANDLE);
fs/ext4/super.c:5482:   root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL);
fs/ext4/super.c:5788:   journal_inode = ext4_iget(sb, journal_inum, EXT4_IGET_SPECIAL);
fs/ext4/super.c:7063:   qf_inode = ext4_iget(sb, qf_inums[type], EXT4_IGET_SPECIAL);
fs/ext4/namei.c:1855:           inode = ext4_iget(dir->i_sb, ino, EXT4_IGET_NORMAL);
fs/ext4/ialloc.c:1394:  inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1379:     inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1386:     old_parent = ext4_iget(sb, darg.parent_ino,
fs/ext4/fast_commit.c:1412:     dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1473:     inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1534:     inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1584:     inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1638:     inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1651:             dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1738:     inode = ext4_iget(sb, le32_to_cpu(fc_add_ex.fc_ino), EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1855:     inode = ext4_iget(sb, le32_to_cpu(lrange.fc_ino), EXT4_IGET_NORMAL);
fs/ext4/fast_commit.c:1911:             inode = ext4_iget(sb, state->fc_modified_inodes[i],
fs/ext4/xattr.c:440:    inode = ext4_iget(parent->i_sb, ea_ino, EXT4_IGET_EA_INODE);
fs/ext4/xattr.c:1542:           ea_inode = ext4_iget(inode->i_sb, ce->e_value, 
Comment by Xinliang Liu [ 27/Dec/23 ]

Or revert check_igot_inode() in ldiskfs series patchset, if we can't know which kind inode when getting it in code osd_scrub.c, as EA inode is really a kind of internal inode of ext4.

Generated at Sat Feb 10 03:34:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.