[LU-17397] mdtest failed (Lustre became read-only) under high stress Created: 06/Jan/24 Updated: 17/Jan/24 Resolved: 17/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0, Lustre 2.15.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Zuoru Yang | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client/server: CentOS-8.5.2111 + Lustre 2.15.3 |
||
| Attachments: |
|
| Epic/Theme: | ldiskfs, lustre-2.15.0, lustre-2.15.3 |
| Severity: | 3 |
| Epic: | ldiskfs, metadata, performance |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We test metadata performance in a simple Lustre environment, where we deploy two servers (#server01, #server02) and both connect to a SAN storage:
Here, MDS and OSS run in the same server, and Lustre includes two MDTs and 8 OSTs. [root@client02 lustre]# lfs df -h filesystem_summary: 95.1T 28.6T 61.8T 32% /lustre
We leverage mdtest, mpirun with two clients to test metadate performance under the configuration above, the test command is as follows:
After stably running around 15 mins, Lustre becomes read-only (blocks the whole test) and generate the sys log as follows: [Fri Jan 5 17:29:36 2024] Lustre: l_lfs-OST0001: deleting orphan objects from 0x440000400:26730785 to 0x440000400:26744321
We repeat the test many times, and still get the similar result (i.e., the LDISKFS-fs error in MDT0 or MDT1), and the workload scale is as follow:
[root@client01 lustre]# lfs quota -u root /lustre/
Originally, we find this issue with 2.15.0 and we try to upgrade to 2.15.3, but this issue still exists and block our test.
|
| Comments |
| Comment by Andreas Dilger [ 06/Jan/24 ] |
|
I haven't specifically checked if the issue is fixed in 2.15.4, but it was just released last week and may help. It would only need to be installed on the server nodes. |
| Comment by Andreas Dilger [ 06/Jan/24 ] |
|
Just to confirm the test being run, each rank is creating 160000 files in a separate subdirectory from the other ranks, and there are 2^3 leaf subdirectories (branching factor 2, depth 3)? That would create about 82M files, but it looks like there are some existing files in the filesystem. What does e2fsck show when run on the corrupt MDT? |
| Comment by Zuoru Yang [ 09/Jan/24 ] |
|
@Andreas Dilger Sorry for my late reply, we spend some time to check our RAID to ensure that this is not caused by the storage backend. We consider that it might be a bug in EXT4? Yes, there exists some files in the filesystem of previous experiments, and we remove them and try again with the same test command this time. The issue still occurs, and the info is as follows (since it does not create all files due to this issue): [root@client02 ~]# lfs quota -u root /lustre/ [Tue Jan 9 20:51:22 2024] LDISKFS-fs error (device ultrapathb): dx_probe:1169: inode #104316384: block 149479: comm mdt05_002: directory leaf block found instead of index block
Note that device ultrapathb is the backend of MDT1, and the following is the process record when we do e2fsck for device ultrapathb
Script started on 2024-01-09 21:09:49+08:00 Script done on 2024-01-09 21:16:33+08:00
Is that possible an issue from ext4 with large_dir? |
| Comment by Andreas Dilger [ 11/Jan/24 ] |
|
Lustre does not modify the on-disk data structures of ldiskfs directly, although it is accessing the filesystem somewhat differently than a regular ext4 mount does. I don't think the issue is with large_dir, but more likely with parallel directory locking and updates. There would need to be some kind of bug in ext4 or the ldiskfs patches applied. It is not possible for the clients to corrupt the server filesystem directly. That said, it appears from the e2fsck output that the on-disk data structures are not corrupted, so it seems like this is some kind of in-memory corruption? The free blocks/inode counts quota usage messages are normal for a filesystem that is in use. There is a tunable parameter to disable the parallel directory locking and updates with "lctl set_param osd-ldiskfs.lustre-MDT*.pdo=0" on the MDS nodes. Note, that this is never tested and potentially could have some issues, beyond being much slower, but it would be useful to test if this avoids the issue. |
| Comment by Andreas Dilger [ 11/Jan/24 ] |
|
Also, have you tried updating to a newer kernel? It is possible that the ext4 in the kernel (and ldiskfs that is generated from this) has a bug that has since been fixed. |
| Comment by Zuoru Yang [ 11/Jan/24 ] |
|
@Andreas Dilger Thanks Andreas! We will follow this direction and try the same test with a newer kernel. |
| Comment by Zuoru Yang [ 12/Jan/24 ] |
|
@Andreas Dilger BTW, the reason why I initially consider this issue is related large_dir is this link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1933074 which also reports "directory leaf block found instead of index block" when there are millions of files on ext4. Never mind, we will test this issue with a newer kernel (e.g., in AlmaLinux 8.8 + 2.15.3) |
| Comment by Andreas Dilger [ 17/Jan/24 ] |
|
yzr95924, thank you for your launchpad reference. Indeed that bug looks like it could be related. That patch is reported included in upstream kernel 5.14 and stable kernel 5.11, and fixing a bug originally in kernel 5.11 (but also backported to the RHEL kernel): commit 877ba3f729fd3d8ef0e29bc2a55e57cfa54b2e43
Author: Theodore Ts'o <tytso@mit.edu>
AuthorDate: Wed Aug 4 14:23:55 2021 -0400
ext4: fix potential htree corruption when growing large_dir directories
Commit b5776e7524af ("ext4: fix potential htree index checksum
corruption) removed a required restart when multiple levels of index
nodes need to be split. Fix this to avoid directory htree corruptions
when using the large_dir feature.
Cc: stable@kernel.org # v5.11
Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption)
Reported-by: Denis <denis@voxelsoft.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
I can confirm that the patch is applied in 4.18.0-425.13.1.el8_7.x86_64 in fs/ext4/namei.c:
if (err)
goto journal_error;
err = ext4_handle_dirty_dx_node(handle, dir,
frame->bh);
if (restart || err)
goto journal_error;
but I'm not sure whether it is applied in your kernel 4.18.0-348.2.1.el8_lustre.x86_64. |
| Comment by Zuoru Yang [ 17/Jan/24 ] |
|
@Andreas Dilger, Hi Andreas, thanks for your insights. We double-checked the linux kernel in our env (actually, we install the kernel package from the Whamcloud with 2.15.0 repo (later upgrade Lustre server to 2.15.3): https://downloads.whamcloud.com/public/lustre/lustre-2.15.0-ib/MOFED-5.6-1.0.3.3/el8.5.2111/server/RPMS/x86_64/), and we confirm that the kernel in the link does not have the patch.
|
| Comment by Andreas Dilger [ 17/Jan/24 ] |
|
Time to upgrade your server kernel and rebuild in that case. |
| Comment by Zuoru Yang [ 17/Jan/24 ] |
|
@Andreas Dilger Sure, we have evaluated the same test case in AlmaLinux 8.8 + 2.15.3 with the new kernel (4.18.0-477.10.1.el8_lustre.x86_64), now the issue did not occur. Thanks again! |