Loading...

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Labels:
None
Environment:
SWL - Hyperion/LLNL

Severity:
3
Rank (Obsolete):
4217

Running SWL test, after ~6 hours seeing ldiskfs failure and EROFS on the MDT.

 
Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4918:0:(mdt_recovery.c:622:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
Sep 27 20:13:20 hyperion-rst6 kernel: Lustre: 4233:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff880108e55000 x1414295525519816.t4537566185 o0 NID 192.168.116.112@o2ib1
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
Sep 27 20:16:39 hyperion-rst6 kernel: Aborting journal on device md1-8.
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4774:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs (md1): bad entry in directory #127928380: rec_len is smaller than minimal - block=64030811offset=536(536), inode=127946027, rec_len=0, name_len=4
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4865:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
Sep 27 20:16:39 hyperion-rst6 kernel: LustreError: 4731:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
Sep 27 20:16:39 hyperion-rst6 kernel: Remounting filesystem read-only
Sep 27 20:16:39 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted

Ran fsck on device, recovered

LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
LDISKFS-fs (md1): mounted filesystem with ordered data mode. quota=off. Opts:
Lustre: MGC192.168.127.6@o2ib: Reactivating import
Lustre: lustre-MDT0000: used disk, loading
Lustre: 5311:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 4541171590
Lustre: lustre-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
Lustre: 4326:0:(mgc_request.c:1534:mgc_process_recover_log()) Skipped 2 previous similar messages
Lustre: lustre-MDT0000: Will be in recovery for at least 2:30, or until 91 clients reconnect
LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541171591, ql: 62, comp: 29, conn: 91, next: 4541177085, last_committed: 4541171590)
LustreError: 5315:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
LustreError: 5315:0:(ldlm_lib.c:1717:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 4541177086, ql: 62, comp: 29, conn: 91, next: 4541181324, last_committed: 4541171590)
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: lustre-MDT0000: Recovery over after 1:24, of 91 clients 90 recovered and 1 was evicted.
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0028_UUID now active, resetting orphans

Will continue to run, see if we get repeat.

is duplicated by

LU-2080 SWL -ldiskfs_add_entry: bad entry in directory #: rec_len is smaller than minimal

Resolved

Details

Description

Attachments

Issue Links

Activity

People

Dates