Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12268

LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,
      I've hit an ldiskfs error today while doing more 'mdtest' benchmarking runs against our all-flash nvme-based filesystem, running Lustre 2.12.0. Below is the syslog output from the affected server:

      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_find_dest_de:2066: inode #25165841: block 15874389: comm mdt00_054: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=201, rec_len=0, name_len=0
      May 07 16:47:23 dac-e-3 kernel: Aborting journal on device dm-5-8.
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0002: can't update reply_data file: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) Skipped 1 previous similar message
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0002: failed in transaction hook: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 181259:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff913ee0551b00
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 136266:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff913cc5e73e00 commit error: 2
      

      I detailed the environment of this filesystem in https://jira.whamcloud.com/browse/LU-12265 in more detail - I'm doing the exact same tests as in that ticket as part of an IO500 submission, and specifically I'm seeing problems when running mdtest, but this appeared to be a different error so I thought I would raise a new ticket.

      The filesystem itself is configured using DNE and specifically we are using DNE2 striped directories for all mdtest runs. We are using a large number of MDTs, 24 at the moment, one-per server, (which other than this problem, is otherwise working excellently), and the directory-stripe is '-1', so we are striping all the directories over all 24 MDTs, one per server. Each server contains 12 NVMe drives, and we partition one of the drives so it has both an OST and MDT partition.

      Lustre and Kernel versions are as follows:

      Server: kernel-3.10.0-957.el7_lustre.x86_64
      Server: lustre-2.12.0-1.el7.x86_64
      
      Clients: kernel-3.10.0-957.10.1.el7.x86_64
      Clients: lustre-client-2.10.7-1.el7.x86_64
      

      The test itself:

      mdtest-1.9.3 was launched with 4096 total task(s) on 128 node(s)
      Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "70000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11335245-2019-05-07-1537/mdt_easy"
      Path: /dac/fs1/mjr208/job11335245-2019-05-07-1537
      FS: 412.6 TiB   Used FS: 24.2%   Inodes: 960.0 Mi   Used Inodes: 0.0%
      
      4096 tasks, 286720000 files
      

      Is there anything from the above log that indicates the cause of the problem? Do I perhaps have a bad device here, or is this perhaps a bug?

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: