Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2041

SWL ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • SWL - Hyperion/LLNL
    • 3
    • 4217

    Description

      Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT.

       
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
      Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only
      Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
      

      Ran fsck on device, recovered

      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
      Lustre: MGC192.168.127.6@o2ib: Reactivating import
      Lustre: lustre-MDT0000: used disk, loading
      Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590
      Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
      Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages
      Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590)
      LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
      LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590)
      Lustre: lustre-MDT0000: disconnecting 1 stale clients
      Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted.
      Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans
      
      

      Will continue to run, see if we get repeat.

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: