Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17397

mdtest failed (Lustre became read-only) under high stress

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.15.0, Lustre 2.15.3
    • None
    • client/server: CentOS-8.5.2111 + Lustre 2.15.3
      Linux 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Fri Jun 17 00:10:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

    Description

      We test metadata performance in a simple Lustre environment, where we deploy two servers (#server01, #server02) and both connect to a SAN storage:

      • For #server01: we mount a MGT, a MDT, and four OSTs 
      • For #server02: we mount a MDT, and four OSTs

      Here, MDS and OSS run in the same server, and Lustre includes two MDTs and 8 OSTs.

      [root@client02 lustre]# lfs df -h
      UUID                       bytes        Used   Available Use% Mounted on
      l_lfs-MDT0000_UUID          1.8T       39.2G        1.6T   3% /lustre[MDT:0] 
      l_lfs-MDT0001_UUID          1.8T       39.4G        1.6T   3% /lustre[MDT:1] 
      l_lfs-OST0000_UUID         11.9T        3.5T        7.8T  31% /lustre[OST:0] 
      l_lfs-OST0001_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:1] 
      l_lfs-OST0002_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:2] 
      l_lfs-OST0003_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:3] 
      l_lfs-OST0004_UUID         11.9T        3.8T        7.5T  34% /lustre[OST:4] 
      l_lfs-OST0005_UUID         11.9T        3.5T        7.8T  32% /lustre[OST:5] 
      l_lfs-OST0006_UUID         11.9T        3.5T        7.8T  31% /lustre[OST:6] 
      l_lfs-OST0007_UUID         11.9T        3.6T        7.7T  32% /lustre[OST:7] 

      filesystem_summary:        95.1T       28.6T       61.8T  32% /lustre

       

      We leverage mdtest, mpirun with two clients to test metadate performance under the configuration above, the test command is as follows:

      • $> mpirun --allow-run-as-root --oversubscribe -mca btl ^openib --mca btl_tcp_if_include 40.40.22.0/24 -np 64 -host client01:32,client02:32 --map-by node mdtest -L -z 3 -b 2 -I 160000 -i 1 -d /lustre/mdtest_demo | tee 2client_64np_3z_2b_160000I.log

       

      After stably running around 15 mins, Lustre becomes read-only (blocks the whole test) and generate the sys log as follows:

      [Fri Jan  5 17:29:36 2024] Lustre: l_lfs-OST0001: deleting orphan objects from 0x440000400:26730785 to 0x440000400:26744321
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] Aborting journal on device ultrapatha-8.
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs (ultrapatha): Remounting filesystem read-only
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): ldiskfs_journal_check_start:61: Detected aborted journal
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LustreError: 61165:0:(osd_handler.c:1790:osd_trans_commit_cb()) transaction @0x0000000082b2d9d3 commit error: 2
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt08_003: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt08_003: directory leaf block found instead of index block
      [Fri Jan  5 17:43:19 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt05_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:20 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:20 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt18_002: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error: 355 callbacks suppressed
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt21_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block
      [Fri Jan  5 17:43:24 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt19_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt18_001: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt07_004: directory leaf block found instead of index block
      [Fri Jan  5 17:43:25 2024] LDISKFS-fs error (device ultrapatha): dx_probe:1169: inode #61343829: block 151386: comm mdt20_000: directory leaf block found instead of index block

       

      We repeat the test many times, and still get the similar result (i.e., the LDISKFS-fs error in MDT0 or MDT1), and the workload scale is as follow:

       

      [root@client01 lustre]# lfs quota -u root /lustre/
      Disk quotas for usr root (uid 0):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             /lustre/ 30805903960       0       0       - 96453928       0       0       -

       

      Originally, we find this issue with 2.15.0 and we try to upgrade to 2.15.3, but this issue still exists and block our test.

       

       

       

       

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            yzr95924 Zuoru Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: