Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17397

mdtest failed (Lustre became read-only) under high stress

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.15.0, Lustre 2.15.3
    • None
    • client/server: CentOS-8.5.2111 + Lustre 2.15.3
      Linux 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Fri Jun 17 00:10:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

    Description

      We test metadata performance in a simple Lustre environment, where we deploy two servers (#server01, #server02) and both connect to a SAN storage:

      • For #server01: we mount a MGT, a MDT, and four OSTs 
      • For #server02: we mount a MDT, and four OSTs

      Here, MDS and OSS run in the same server, and Lustre includes two MDTs and 8 OSTs.

      [root@client02 lustre]# lfs df -h
      UUID                       bytes        Used   Available Use% Mounted on
      l_lfs-MDT0000_UUID          1.8T       39.2G        1.6T   3% /lustre[MDT:0] 
      l_lfs-MDT0001_UUID          1.8T       39.4G        1.6T   3% /lustre[MDT:1] 
      l_lfs-OST0000_UUID         11.9T        3.5T        7.8T  31% /lustre[OST:0] 
      l_lfs-OST0001_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:1] 
      l_lfs-OST0002_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:2] 
      l_lfs-OST0003_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:3] 
      l_lfs-OST0004_UUID         11.9T        3.8T        7.5T  34% /lustre[OST:4] 
      l_lfs-OST0005_UUID         11.9T        3.5T        7.8T  32% /lustre[OST:5] 
      l_lfs-OST0006_UUID         11.9T        3.5T        7.8T  31% /lustre[OST:6] 
      l_lfs-OST0007_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:7] 

      filesystem_summary:        95.1T       28.6T       61.8T  32% /lustre

       

      We leverage mdtest, mpirun with two clients to test metadate performance under the configuration above, the test command is as follows:

      • $> mpirun --allow-run-as-root --oversubscribe -mca btl ^openib --mca btl_tcp_if_include 40.40.22.0/24 -np 64 -host client01:32,client02:32 --map-by node mdtest -L -z 3 -b 2 -I 160000 -i 1 -d /lustre/mdtest_demo | tee 2client_64np_3z_2b_160000I.log

       

      After stably running around 15 mins, Lustre becomes read-only (blocks the whole test) and generate the sys log as follows:

      [Fri Jan  5 17:29:36 2024] Lustre: l_lfs-OST0001: deleting orphan objects from 0x440000400:26730785 to 0x440000400:26744321
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] Aborting journal on device ultrapatha-8.
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs (ultrapatha): Remounting filesystem read-only
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): ldiskfs_journal_check_start:61: Detected aborted journal
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LustreError: 61165:0:(osd_handler.c:1790:osd_trans_commit_cb()) transaction @0x0000000082b2d9d3 commit error: 2
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt08_003: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt08_003: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt05_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:20 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:20 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt18_002: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error: 355 callbacks suppressed
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt18_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block

       

      We repeat the test many times, and still get the similar result (i.e., the LDISKFS-fs error in MDT0 or MDT1), and the workload scale is as follow:

       

      [root@client01 lustre]# lfs quota -u root /lustre/
      Disk quotas for usr root (uid 0):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             /lustre/ 30805903960       0       0       - 96453928       0       0       -

       

      Originally, we find this issue with 2.15.0 and we try to upgrade to 2.15.3, but this issue still exists and block our test.

       

       

       

       

       

      Attachments

        Activity

          [LU-17397] mdtest failed (Lustre became read-only) under high stress
          yzr95924 Zuoru Yang added a comment -

          @Andreas Dilger Sure, we have evaluated the same test case in AlmaLinux 8.8 + 2.15.3 with the new kernel (4.18.0-477.10.1.el8_lustre.x86_64), now the issue did not occur. Thanks again!

          yzr95924 Zuoru Yang added a comment - @Andreas Dilger Sure, we have evaluated the same test case in AlmaLinux 8.8 + 2.15.3 with the new kernel (4.18.0-477.10.1.el8_lustre.x86_64), now the issue did not occur. Thanks again!

          Time to upgrade your server kernel and rebuild in that case.

          adilger Andreas Dilger added a comment - Time to upgrade your server kernel and rebuild in that case.
          yzr95924 Zuoru Yang added a comment -

          @Andreas Dilger, Hi Andreas, thanks for your insights. We double-checked the linux kernel in our env  (actually, we install the kernel package from the Whamcloud with 2.15.0 repo (later upgrade Lustre server to 2.15.3): https://downloads.whamcloud.com/public/lustre/lustre-2.15.0-ib/MOFED-5.6-1.0.3.3/el8.5.2111/server/RPMS/x86_64/), and we confirm that the kernel in the link does not have the patch. 

          yzr95924 Zuoru Yang added a comment - @Andreas Dilger, Hi Andreas, thanks for your insights. We double-checked the linux kernel in our env  (actually, we install the kernel package from the Whamcloud with 2.15.0 repo (later upgrade Lustre server to 2.15.3): https://downloads.whamcloud.com/public/lustre/lustre-2.15.0-ib/MOFED-5.6-1.0.3.3/el8.5.2111/server/RPMS/x86_64/), and we confirm that the kernel in the link does not have the patch. 
          adilger Andreas Dilger added a comment - - edited

          yzr95924, thank you for your launchpad reference. Indeed that bug looks like it could be related. That patch is reported included in upstream kernel 5.14 and stable kernel 5.11, and fixing a bug originally in kernel 5.11 (but also backported to the RHEL kernel):

          commit 877ba3f729fd3d8ef0e29bc2a55e57cfa54b2e43
          Author:     Theodore Ts'o <tytso@mit.edu>
          AuthorDate: Wed Aug 4 14:23:55 2021 -0400
          
              ext4: fix potential htree corruption when growing large_dir directories
              
              Commit b5776e7524af ("ext4: fix potential htree index checksum
              corruption) removed a required restart when multiple levels of index
              nodes need to be split.  Fix this to avoid directory htree corruptions
              when using the large_dir feature.
              
              Cc: stable@kernel.org # v5.11
              Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
              Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption)
              Reported-by: Denis <denis@voxelsoft.com>
              Signed-off-by: Theodore Ts'o <tytso@mit.edu>
          

          I can confirm that the patch is applied in 4.18.0-425.13.1.el8_7.x86_64 in fs/ext4/namei.c:

                                  if (err)
                                          goto journal_error;
                                  err = ext4_handle_dirty_dx_node(handle, dir,
                                                                  frame->bh);
                                  if (restart || err)
                                          goto journal_error;
          

          but I'm not sure whether it is applied in your kernel 4.18.0-348.2.1.el8_lustre.x86_64.

          adilger Andreas Dilger added a comment - - edited yzr95924 , thank you for your launchpad reference. Indeed that bug looks like it could be related. That patch is reported included in upstream kernel 5.14 and stable kernel 5.11, and fixing a bug originally in kernel 5.11 (but also backported to the RHEL kernel): commit 877ba3f729fd3d8ef0e29bc2a55e57cfa54b2e43 Author: Theodore Ts'o <tytso@mit.edu> AuthorDate: Wed Aug 4 14:23:55 2021 -0400 ext4: fix potential htree corruption when growing large_dir directories Commit b5776e7524af ("ext4: fix potential htree index checksum corruption) removed a required restart when multiple levels of index nodes need to be split. Fix this to avoid directory htree corruptions when using the large_dir feature. Cc: stable@kernel.org # v5.11 Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com> Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption) Reported-by: Denis <denis@voxelsoft.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> I can confirm that the patch is applied in 4.18.0-425.13.1.el8_7.x86_64 in fs/ext4/namei.c : if (err) goto journal_error; err = ext4_handle_dirty_dx_node(handle, dir, frame->bh); if (restart || err) goto journal_error; but I'm not sure whether it is applied in your kernel 4.18.0-348.2.1.el8_lustre.x86_64 .
          yzr95924 Zuoru Yang added a comment -

          @Andreas Dilger BTW, the reason why I initially consider this issue is related large_dir is this link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1933074

          which also reports "directory leaf block found instead of index block" when there are millions of files on ext4. Never mind, we will test this issue with a newer kernel (e.g., in AlmaLinux 8.8 + 2.15.3)

          yzr95924 Zuoru Yang added a comment - @Andreas Dilger BTW, the reason why I initially consider this issue is related large_dir is this link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1933074 which also reports "directory leaf block found instead of index block" when there are millions of files on ext4. Never mind, we will test this issue with a newer kernel (e.g., in AlmaLinux 8.8 + 2.15.3)
          yzr95924 Zuoru Yang added a comment -

          @Andreas Dilger Thanks Andreas! We will follow this direction and try the same test with a newer kernel.

          yzr95924 Zuoru Yang added a comment - @Andreas Dilger Thanks Andreas! We will follow this direction and try the same test with a newer kernel.

          Also, have you tried updating to a newer kernel? It is possible that the ext4 in the kernel (and ldiskfs that is generated from this) has a bug that has since been fixed.

          adilger Andreas Dilger added a comment - Also, have you tried updating to a newer kernel? It is possible that the ext4 in the kernel (and ldiskfs that is generated from this) has a bug that has since been fixed.

          People

            wc-triage WC Triage
            yzr95924 Zuoru Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: