[LU-12268] LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0 Created: 07/May/19 Updated: 02/Mar/23 Resolved: 24/May/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Matt Rásó-Barnett (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hello, May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_find_dest_de:2066: inode #25165841: block 15874389: comm mdt00_054: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=201, rec_len=0, name_len=0 May 07 16:47:23 dac-e-3 kernel: Aborting journal on device dm-5-8. May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0002: can't update reply_data file: rc = -30 May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) Skipped 1 previous similar message May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0002: failed in transaction hook: rc = -30 May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30 May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30 May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LustreError: 181259:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff913ee0551b00 May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal May 07 16:47:23 dac-e-3 kernel: LustreError: 136266:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff913cc5e73e00 commit error: 2 I detailed the environment of this filesystem in https://jira.whamcloud.com/browse/LU-12265 in more detail - I'm doing the exact same tests as in that ticket as part of an IO500 submission, and specifically I'm seeing problems when running mdtest, but this appeared to be a different error so I thought I would raise a new ticket.
The test itself: mdtest-1.9.3 was launched with 4096 total task(s) on 128 node(s) Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "70000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11335245-2019-05-07-1537/mdt_easy" Path: /dac/fs1/mjr208/job11335245-2019-05-07-1537 FS: 412.6 TiB Used FS: 24.2% Inodes: 960.0 Mi Used Inodes: 0.0% 4096 tasks, 286720000 files Is there anything from the above log that indicates the cause of the problem? Do I perhaps have a bad device here, or is this perhaps a bug? |
| Comments |
| Comment by Matt Rásó-Barnett (Inactive) [ 07/May/19 ] |
|
Looking over my server logs I see another instance of this error from yesterday during similar runs, but on a different server/disk: May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_find_dest_de:2066: inode #26738690: block 16857697: comm mdt01_053: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=191, rec_len=0, name_len=0 May 06 12:57:51 dac-e-9 kernel: Aborting journal on device dm-3-8. May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_journal_check_start:56: Detected aborted journal May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3) in iam_txn_add:575: Journal has aborted May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff9434a2333100 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0008: can't update reply_data file: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0008: failed in transaction hook: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0008: failed to stop transaction: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 112648:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff9431c4a39500 commit error: 2 |
| Comment by Joseph Gmitter (Inactive) [ 08/May/19 ] |
|
Hi Hongchao, Can you please advise? Thanks. |
| Comment by Alexander Boyko [ 13/Jul/21 ] |
|
I've been working on a similar issue, and I'm adding findings. [2401623.056209] LDISKFS-fs error (device md1): ldiskfs_find_dest_de:2066: inode #1943585330: block 1932812092: comm mdt01_029: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=792330240, rec_len=0, name_len=0 direntry had a next data ffff8a9b27df3000: 00 00 3a 2f 00 00 00 00 2e 02 00 00 00 00 00 02 ..:/............ ffff8a9b27df3010: 40 00 71 f2 00 00 3a 84 00 00 00 00 32 02 00 00 @.q...:.....2... ffff8a9b27df3020: 00 00 00 02 40 00 71 f2 00 00 3a d9 00 00 00 00 ....@.q...:..... ffff8a9b27df3030: 35 02 00 00 00 00 00 02 40 00 71 f2 00 00 3b 2e 5.......@.q...;. ffff8a9b27df3040: 00 00 00 00 38 02 00 00 00 00 00 02 40 00 71 f2 ....8.......@.q. ffff8a9b27df3050: 00 00 3b 83 00 00 00 00 3b 02 00 00 00 00 00 02 ..;.....;....... It was corrupted with a specific pattern, lu_fid + u32. From offset 0xc [0x2400071f2:0x3a84:0], 0x0232 crash> bt 66295 PID: 66295 TASK: ffff8acb3e7830c0 CPU: 5 COMMAND: "mdt01_042" #0 [ffff8acbebf47028] __schedule at ffffffff9276ab17 #1 [ffff8acbebf470b8] schedule at ffffffff9276b019 #2 [ffff8acbebf470c8] schedule_timeout at ffffffff92768b21 #3 [ffff8acbebf47170] io_schedule_timeout at ffffffff9276a6bd #4 [ffff8acbebf471a0] io_schedule at ffffffff9276a758 #5 [ffff8acbebf471b0] bit_wait_io at ffffffff92769141 #6 [ffff8acbebf471c8] __wait_on_bit_lock at ffffffff92768cf1 #7 [ffff8acbebf47208] out_of_line_wait_on_bit_lock at ffffffff92768f41 #8 [ffff8acbebf47280] __lock_buffer at ffffffff922792b2 #9 [ffff8acbebf47290] ldiskfs_commit_super at ffffffffc1483444 [ldiskfs] #10 [ffff8acbebf472d8] __ldiskfs_std_error at ffffffffc1483cbd [ldiskfs] #11 [ffff8acbebf47320] iam_txn_add at ffffffffc151033c [osd_ldiskfs] #12 [ffff8acbebf47340] iam_add_rec at ffffffffc1512e63 [osd_ldiskfs] #13 [ffff8acbebf473c0] iam_insert at ffffffffc15139ce [osd_ldiskfs] #14 [ffff8acbebf47530] osd_oi_iam_refresh at ffffffffc150a325 [osd_ldiskfs] #15 [ffff8acbebf47580] osd_oi_insert at ffffffffc150ceec [osd_ldiskfs] #16 [ffff8acbebf47600] osd_create at ffffffffc150558d [osd_ldiskfs] #17 [ffff8acbebf47690] lod_sub_create at ffffffffc114cd25 [lod] #18 [ffff8acbebf47738] lod_create at ffffffffc113d519 [lod] #19 [ffff8acbebf47778] mdd_create_object_internal at ffffffffc11d7f23 [mdd] #20 [ffff8acbebf477b0] mdd_create_object at ffffffffc11c12fb [mdd] #21 [ffff8acbebf47828] mdd_create at ffffffffc11cb351 [mdd] #22 [ffff8acbebf47920] mdt_reint_open at ffffffffc1278da3 [mdt] #23 [ffff8acbebf47a20] mdt_reint_rec at ffffffffc126b663 [mdt] #24 [ffff8acbebf47a48] mdt_reint_internal at ffffffffc1245543 [mdt] #25 [ffff8acbebf47a88] mdt_intent_open at ffffffffc1251ea3 [mdt] IAM index frame had much more entries than limit was. crash> iam_iterator ffff8acbebf473c8
struct iam_iterator {
ii_flags = 2,
ii_state = IAM_IT_ATTACHED,
ii_path = {
ip_container = 0xffff8aaced341c08,
ip_indirect = 1,
ip_frames = {{
bh = 0xffff8acc9c734208,
entries = 0xffff8acca34b8010,
at = 0xffff8acca34b8024,
leaf = 203,
curidx = 0,
at_shifted = 0
}, {
bh = 0xffff8a9871fe7f48,
entries = 0xffff8a9b27df0000,
at = 0xffff8a9b27df10f4,
leaf = 640,
curidx = 203,
at_shifted = 0
}, {
bh = 0x0,
...
dx_countlimit 0xffff8a9b27df0000
struct dx_countlimit {
limit = 204,
count = 640
}
IAM index started at 0xffff8a9b27df0000 and ends at ffff8a9b27df3xxx. It occupied 3+ pages instead of one. Base on b_end_io=0, buffer head for it was not red from a disk, it was a new. So r/c is not a disk corruption. I have not found a root cause at IAM sources, but all imprints lead to iam_add_rec()->split_index_node()/iam_new_leaf(). |
| Comment by Andreas Dilger [ 13/Jul/21 ] |
|
If this IAM code is based on the upstream ext4/ldiskfs htree code, then there were a few patches from Eric Sandeen that fixed bugs in the htree split code. It would be worthwhile to do a "git log fs/ext4/htree.c" to see what has changed since this code was created (which was at least 10 years ago, maybe Lustre 2.0?). |
| Comment by Gerrit Updater [ 28/Sep/21 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45072 |
| Comment by Gerrit Updater [ 10/Oct/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45072/ |