Loading...

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hello,
I've hit an ldiskfs error today while doing more 'mdtest' benchmarking runs against our all-flash nvme-based filesystem, running Lustre 2.12.0. Below is the syslog output from the affected server:

May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_find_dest_de:2066: inode #25165841: block 15874389: comm mdt00_054: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=201, rec_len=0, name_len=0
May 07 16:47:23 dac-e-3 kernel: Aborting journal on device dm-5-8.
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0002: can't update reply_data file: rc = -30
May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) Skipped 1 previous similar message
May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0002: failed in transaction hook: rc = -30
May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LustreError: 181259:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff913ee0551b00
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
May 07 16:47:23 dac-e-3 kernel: LustreError: 136266:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff913cc5e73e00 commit error: 2

I detailed the environment of this filesystem in https://jira.whamcloud.com/browse/LU-12265 in more detail - I'm doing the exact same tests as in that ticket as part of an IO500 submission, and specifically I'm seeing problems when running mdtest, but this appeared to be a different error so I thought I would raise a new ticket.

The filesystem itself is configured using DNE and specifically we are using DNE2 striped directories for all mdtest runs. We are using a large number of MDTs, 24 at the moment, one-per server, (which other than this problem, is otherwise working excellently), and the directory-stripe is '-1', so we are striping all the directories over all 24 MDTs, one per server. Each server contains 12 NVMe drives, and we partition one of the drives so it has both an OST and MDT partition.

Lustre and Kernel versions are as follows:
Server: kernel-3.10.0-957.el7_lustre.x86_64
Server: lustre-2.12.0-1.el7.x86_64

Clients: kernel-3.10.0-957.10.1.el7.x86_64
Clients: lustre-client-2.10.7-1.el7.x86_64

The test itself:

mdtest-1.9.3 was launched with 4096 total task(s) on 128 node(s)
Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "70000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11335245-2019-05-07-1537/mdt_easy"
Path: /dac/fs1/mjr208/job11335245-2019-05-07-1537
FS: 412.6 TiB   Used FS: 24.2%   Inodes: 960.0 Mi   Used Inodes: 0.0%

4096 tasks, 286720000 files

Is there anything from the above log that indicates the cause of the problem? Do I perhaps have a bad device here, or is this perhaps a bug?

Attachments

Issue Links

is related to

LU-16610 ldiskfs_find_dest_de bad entry in directory when running io500 test

Resolved

LU-15016 OI Scrub backup and rebuild

Open

is related to

LU-12265 LustreError: 141027:0:(osd_iam_lfix.c:188:iam_lfix_init()) Bad magic in node 1861726 #34: 0xcc != 0x1976 or bad cnt: 0 170: rc = -5

Reopened

LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0

Details

Description

Attachments

Issue Links

Activity

People

Dates