Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12268

LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,
      I've hit an ldiskfs error today while doing more 'mdtest' benchmarking runs against our all-flash nvme-based filesystem, running Lustre 2.12.0. Below is the syslog output from the affected server:

      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_find_dest_de:2066: inode #25165841: block 15874389: comm mdt00_054: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=201, rec_len=0, name_len=0
      May 07 16:47:23 dac-e-3 kernel: Aborting journal on device dm-5-8.
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs (dm-5): Remounting filesystem read-only
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5) in ldiskfs_write_dquot:5425: IO failure
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0002: can't update reply_data file: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_io.c:2059:osd_ldiskfs_write_record()) Skipped 1 previous similar message
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0002: failed in transaction hook: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275116:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
      May 07 16:47:23 dac-e-3 kernel: LustreError: 275209:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0002: failed to stop transaction: rc = -30
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: Quota error (device dm-5): qtree_write_dquot: dquota write failed
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 181259:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff913ee0551b00
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LDISKFS-fs error (device dm-5): ldiskfs_journal_check_start:56: Detected aborted journal
      May 07 16:47:23 dac-e-3 kernel: LustreError: 136266:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff913cc5e73e00 commit error: 2
      

      I detailed the environment of this filesystem in https://jira.whamcloud.com/browse/LU-12265 in more detail - I'm doing the exact same tests as in that ticket as part of an IO500 submission, and specifically I'm seeing problems when running mdtest, but this appeared to be a different error so I thought I would raise a new ticket.

      The filesystem itself is configured using DNE and specifically we are using DNE2 striped directories for all mdtest runs. We are using a large number of MDTs, 24 at the moment, one-per server, (which other than this problem, is otherwise working excellently), and the directory-stripe is '-1', so we are striping all the directories over all 24 MDTs, one per server. Each server contains 12 NVMe drives, and we partition one of the drives so it has both an OST and MDT partition.

      Lustre and Kernel versions are as follows:

      Server: kernel-3.10.0-957.el7_lustre.x86_64
      Server: lustre-2.12.0-1.el7.x86_64
      
      Clients: kernel-3.10.0-957.10.1.el7.x86_64
      Clients: lustre-client-2.10.7-1.el7.x86_64
      

      The test itself:

      mdtest-1.9.3 was launched with 4096 total task(s) on 128 node(s)
      Command line used: /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/mdtest "-C" "-n" "70000" "-u" "-L" "-F" "-d" "/dac/fs1/mjr208/job11335245-2019-05-07-1537/mdt_easy"
      Path: /dac/fs1/mjr208/job11335245-2019-05-07-1537
      FS: 412.6 TiB   Used FS: 24.2%   Inodes: 960.0 Mi   Used Inodes: 0.0%
      
      4096 tasks, 286720000 files
      

      Is there anything from the above log that indicates the cause of the problem? Do I perhaps have a bad device here, or is this perhaps a bug?

      Attachments

        Issue Links

          Activity

            [LU-12268] LDISKFS-fs error: ldiskfs_find_dest_de:2066: bad entry in directory: rec_len is smaller than minimal - offset=0( 0), inode=201, rec_len=0, name_len=0

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45072/
            Subject: LU-12268 osd: BUG_ON for IAM corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5daf86607877ea81d0295a9d49a1fe06572e0352

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45072/ Subject: LU-12268 osd: BUG_ON for IAM corruption Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5daf86607877ea81d0295a9d49a1fe06572e0352

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45072
            Subject: LU-12268 osd: BUG_ON for IAM corruption
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3cc624012d272088f98f3f91e155d562a0eff066

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45072 Subject: LU-12268 osd: BUG_ON for IAM corruption Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3cc624012d272088f98f3f91e155d562a0eff066

            If this IAM code is based on the upstream ext4/ldiskfs htree code, then there were a few patches from Eric Sandeen that fixed bugs in the htree split code. It would be worthwhile to do a "git log fs/ext4/htree.c" to see what has changed since this code was created (which was at least 10 years ago, maybe Lustre 2.0?).

            adilger Andreas Dilger added a comment - If this IAM code is based on the upstream ext4/ldiskfs htree code, then there were a few patches from Eric Sandeen that fixed bugs in the htree split code. It would be worthwhile to do a " git log fs/ext4/htree.c " to see what has changed since this code was created (which was at least 10 years ago, maybe Lustre 2.0?).

            I've been working on a similar issue, and I'm adding findings.

             
            [2401623.056209] LDISKFS-fs error (device md1): ldiskfs_find_dest_de:2066: inode #1943585330: block 1932812092: comm mdt01_029: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=792330240, rec_len=0, name_len=0
            

            direntry had a next data

            ffff8a9b27df3000:  00 00 3a 2f 00 00 00 00 2e 02 00 00 00 00 00 02   ..:/............
            ffff8a9b27df3010:  40 00 71 f2 00 00 3a 84 00 00 00 00 32 02 00 00   @.q...:.....2...
            ffff8a9b27df3020:  00 00 00 02 40 00 71 f2 00 00 3a d9 00 00 00 00   ....@.q...:.....
            ffff8a9b27df3030:  35 02 00 00 00 00 00 02 40 00 71 f2 00 00 3b 2e   5.......@.q...;.
            ffff8a9b27df3040:  00 00 00 00 38 02 00 00 00 00 00 02 40 00 71 f2   ....8.......@.q.
            ffff8a9b27df3050:  00 00 3b 83 00 00 00 00 3b 02 00 00 00 00 00 02   ..;.....;.......
            

            It was corrupted with a specific pattern, lu_fid + u32. From offset 0xc [0x2400071f2:0x3a84:0], 0x0232
            It is very close to dir_data fids, but is not encapsulated at direntry struct. I've checked previous memory page, and it includes the same pattern, and two more behind.
            So it is simple buffer overflow. After some memory searches with affected address, only one thread was found.

            crash> bt 66295
            PID: 66295  TASK: ffff8acb3e7830c0  CPU: 5   COMMAND: "mdt01_042"
             #0 [ffff8acbebf47028] __schedule at ffffffff9276ab17
             #1 [ffff8acbebf470b8] schedule at ffffffff9276b019
             #2 [ffff8acbebf470c8] schedule_timeout at ffffffff92768b21
             #3 [ffff8acbebf47170] io_schedule_timeout at ffffffff9276a6bd
             #4 [ffff8acbebf471a0] io_schedule at ffffffff9276a758
             #5 [ffff8acbebf471b0] bit_wait_io at ffffffff92769141
             #6 [ffff8acbebf471c8] __wait_on_bit_lock at ffffffff92768cf1
             #7 [ffff8acbebf47208] out_of_line_wait_on_bit_lock at ffffffff92768f41
             #8 [ffff8acbebf47280] __lock_buffer at ffffffff922792b2
             #9 [ffff8acbebf47290] ldiskfs_commit_super at ffffffffc1483444 [ldiskfs]
            #10 [ffff8acbebf472d8] __ldiskfs_std_error at ffffffffc1483cbd [ldiskfs]
            #11 [ffff8acbebf47320] iam_txn_add at ffffffffc151033c [osd_ldiskfs]
            #12 [ffff8acbebf47340] iam_add_rec at ffffffffc1512e63 [osd_ldiskfs]
            #13 [ffff8acbebf473c0] iam_insert at ffffffffc15139ce [osd_ldiskfs]
            #14 [ffff8acbebf47530] osd_oi_iam_refresh at ffffffffc150a325 [osd_ldiskfs]
            #15 [ffff8acbebf47580] osd_oi_insert at ffffffffc150ceec [osd_ldiskfs]
            #16 [ffff8acbebf47600] osd_create at ffffffffc150558d [osd_ldiskfs]
            #17 [ffff8acbebf47690] lod_sub_create at ffffffffc114cd25 [lod]
            #18 [ffff8acbebf47738] lod_create at ffffffffc113d519 [lod]
            #19 [ffff8acbebf47778] mdd_create_object_internal at ffffffffc11d7f23 [mdd]
            #20 [ffff8acbebf477b0] mdd_create_object at ffffffffc11c12fb [mdd]
            #21 [ffff8acbebf47828] mdd_create at ffffffffc11cb351 [mdd]
            #22 [ffff8acbebf47920] mdt_reint_open at ffffffffc1278da3 [mdt]
            #23 [ffff8acbebf47a20] mdt_reint_rec at ffffffffc126b663 [mdt]
            #24 [ffff8acbebf47a48] mdt_reint_internal at ffffffffc1245543 [mdt]
            #25 [ffff8acbebf47a88] mdt_intent_open at ffffffffc1251ea3 [mdt]
            

            IAM index frame had much more entries than limit was.

            crash> iam_iterator ffff8acbebf473c8
            struct iam_iterator {
              ii_flags = 2,
              ii_state = IAM_IT_ATTACHED,
              ii_path = {
                ip_container = 0xffff8aaced341c08,
                ip_indirect = 1,
                ip_frames = {{
                    bh = 0xffff8acc9c734208,
                    entries = 0xffff8acca34b8010,
                    at = 0xffff8acca34b8024,
                    leaf = 203,
                    curidx = 0,
                    at_shifted = 0
                  }, {
                    bh = 0xffff8a9871fe7f48,
                    entries = 0xffff8a9b27df0000,
                    at = 0xffff8a9b27df10f4,
                    leaf = 640,
                    curidx = 203,
                    at_shifted = 0
                  }, {
                    bh = 0x0,
            ...
            
            dx_countlimit 0xffff8a9b27df0000
            struct dx_countlimit {
              limit = 204,
              count = 640
            }
            

            IAM index started at 0xffff8a9b27df0000 and ends at ffff8a9b27df3xxx. It occupied 3+ pages instead of one. Base on b_end_io=0, buffer head for it was not red from a disk, it was a new. So r/c is not a disk corruption. I have not found a root cause at IAM sources, but all imprints lead to iam_add_rec()->split_index_node()/iam_new_leaf().
             

            aboyko Alexander Boyko added a comment - I've been working on a similar issue, and I'm adding findings.   [2401623.056209] LDISKFS-fs error (device md1): ldiskfs_find_dest_de:2066: inode #1943585330: block 1932812092: comm mdt01_029: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=792330240, rec_len=0, name_len=0 direntry had a next data ffff8a9b27df3000: 00 00 3a 2f 00 00 00 00 2e 02 00 00 00 00 00 02 ..:/............ ffff8a9b27df3010: 40 00 71 f2 00 00 3a 84 00 00 00 00 32 02 00 00 @.q...:.....2... ffff8a9b27df3020: 00 00 00 02 40 00 71 f2 00 00 3a d9 00 00 00 00 ....@.q...:..... ffff8a9b27df3030: 35 02 00 00 00 00 00 02 40 00 71 f2 00 00 3b 2e 5.......@.q...;. ffff8a9b27df3040: 00 00 00 00 38 02 00 00 00 00 00 02 40 00 71 f2 ....8.......@.q. ffff8a9b27df3050: 00 00 3b 83 00 00 00 00 3b 02 00 00 00 00 00 02 ..;.....;....... It was corrupted with a specific pattern, lu_fid + u32. From offset 0xc [0x2400071f2:0x3a84:0] , 0x0232 It is very close to dir_data fids, but is not encapsulated at direntry struct. I've checked previous memory page, and it includes the same pattern, and two more behind. So it is simple buffer overflow. After some memory searches with affected address, only one thread was found. crash> bt 66295 PID: 66295 TASK: ffff8acb3e7830c0 CPU: 5 COMMAND: "mdt01_042" #0 [ffff8acbebf47028] __schedule at ffffffff9276ab17 #1 [ffff8acbebf470b8] schedule at ffffffff9276b019 #2 [ffff8acbebf470c8] schedule_timeout at ffffffff92768b21 #3 [ffff8acbebf47170] io_schedule_timeout at ffffffff9276a6bd #4 [ffff8acbebf471a0] io_schedule at ffffffff9276a758 #5 [ffff8acbebf471b0] bit_wait_io at ffffffff92769141 #6 [ffff8acbebf471c8] __wait_on_bit_lock at ffffffff92768cf1 #7 [ffff8acbebf47208] out_of_line_wait_on_bit_lock at ffffffff92768f41 #8 [ffff8acbebf47280] __lock_buffer at ffffffff922792b2 #9 [ffff8acbebf47290] ldiskfs_commit_super at ffffffffc1483444 [ldiskfs] #10 [ffff8acbebf472d8] __ldiskfs_std_error at ffffffffc1483cbd [ldiskfs] #11 [ffff8acbebf47320] iam_txn_add at ffffffffc151033c [osd_ldiskfs] #12 [ffff8acbebf47340] iam_add_rec at ffffffffc1512e63 [osd_ldiskfs] #13 [ffff8acbebf473c0] iam_insert at ffffffffc15139ce [osd_ldiskfs] #14 [ffff8acbebf47530] osd_oi_iam_refresh at ffffffffc150a325 [osd_ldiskfs] #15 [ffff8acbebf47580] osd_oi_insert at ffffffffc150ceec [osd_ldiskfs] #16 [ffff8acbebf47600] osd_create at ffffffffc150558d [osd_ldiskfs] #17 [ffff8acbebf47690] lod_sub_create at ffffffffc114cd25 [lod] #18 [ffff8acbebf47738] lod_create at ffffffffc113d519 [lod] #19 [ffff8acbebf47778] mdd_create_object_internal at ffffffffc11d7f23 [mdd] #20 [ffff8acbebf477b0] mdd_create_object at ffffffffc11c12fb [mdd] #21 [ffff8acbebf47828] mdd_create at ffffffffc11cb351 [mdd] #22 [ffff8acbebf47920] mdt_reint_open at ffffffffc1278da3 [mdt] #23 [ffff8acbebf47a20] mdt_reint_rec at ffffffffc126b663 [mdt] #24 [ffff8acbebf47a48] mdt_reint_internal at ffffffffc1245543 [mdt] #25 [ffff8acbebf47a88] mdt_intent_open at ffffffffc1251ea3 [mdt] IAM index frame had much more entries than limit was. crash> iam_iterator ffff8acbebf473c8 struct iam_iterator { ii_flags = 2, ii_state = IAM_IT_ATTACHED, ii_path = { ip_container = 0xffff8aaced341c08, ip_indirect = 1, ip_frames = {{ bh = 0xffff8acc9c734208, entries = 0xffff8acca34b8010, at = 0xffff8acca34b8024, leaf = 203, curidx = 0, at_shifted = 0 }, { bh = 0xffff8a9871fe7f48, entries = 0xffff8a9b27df0000, at = 0xffff8a9b27df10f4, leaf = 640, curidx = 203, at_shifted = 0 }, { bh = 0x0, ... dx_countlimit 0xffff8a9b27df0000 struct dx_countlimit { limit = 204, count = 640 } IAM index started at 0xffff8a9b27df0000 and ends at ffff8a9b27df3xxx. It occupied 3+ pages instead of one. Base on b_end_io=0, buffer head for it was not red from a disk, it was a new. So r/c is not a disk corruption. I have not found a root cause at IAM sources, but all imprints lead to iam_add_rec()->split_index_node()/iam_new_leaf().  

            Hi Hongchao,

            Can you please advise?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Hongchao, Can you please advise? Thanks. Joe

            Looking over my server logs I see another instance of this error from yesterday during similar runs, but on a different server/disk:

            May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_find_dest_de:2066: inode #26738690: block 16857697: comm mdt01_053: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=191, rec_len=0, name_len=0
            May 06 12:57:51 dac-e-9 kernel: Aborting journal on device dm-3-8.
            May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only
            May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_journal_check_start:56: Detected aborted journal
            May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only
            May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3) in iam_txn_add:575: Journal has aborted
            May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
            May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff9434a2333100
            May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0008: can't update reply_data file: rc = -30
            May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0008: failed in transaction hook: rc = -30
            May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0008: failed to stop transaction: rc = -30
            May 06 12:57:51 dac-e-9 kernel: LustreError: 112648:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff9431c4a39500 commit error: 2
            
            mrb Matt Rásó-Barnett (Inactive) added a comment - Looking over my server logs I see another instance of this error from yesterday during similar runs, but on a different server/disk: May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_find_dest_de:2066: inode #26738690: block 16857697: comm mdt01_053: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=191, rec_len=0, name_len=0 May 06 12:57:51 dac-e-9 kernel: Aborting journal on device dm-3-8. May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3): ldiskfs_journal_check_start:56: Detected aborted journal May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs (dm-3): Remounting filesystem read-only May 06 12:57:51 dac-e-9 kernel: LDISKFS-fs error (device dm-3) in iam_txn_add:575: Journal has aborted May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_io.c:2059:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -30: lh=ffff9434a2333100 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(tgt_lastrcvd.c:1176:tgt_add_reply_data()) fs1-MDT0008: can't update reply_data file: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2007:osd_trans_stop()) fs1-MDT0008: failed in transaction hook: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 381727:0:(osd_handler.c:2017:osd_trans_stop()) fs1-MDT0008: failed to stop transaction: rc = -30 May 06 12:57:51 dac-e-9 kernel: LustreError: 112648:0:(osd_handler.c:1707:osd_trans_commit_cb()) transaction @0xffff9431c4a39500 commit error: 2

            People

              hongchao.zhang Hongchao Zhang
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: