Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
None
-
Centos 6.5
Linux 2.6.32-431.23.3.el6_lustre.x86_64
-
3
-
9223372036854775807
Description
We're having an issue with our mds crashing. This is after recovering from a full md filesystem. We've been deleting from storage to free up metadata space, but have run into these kernel panics.
dmesg logs have the following:
<2>LDISKFS-jfs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 0corrupted: 57 blocks free in bitmap, 6 - in gd <4> <3>Aborting journal on device md0-8. <2>LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal <2>LDISKFS-fs error (device md0) in iam_txn_add: Journal has aborted <2>LDISKFS-fs (md0): Remounting filesystem read-only <2>LDISKFS-fs (md0): Remounting filesystem read-only <3>LustreError: 6919:0:(osd_io.c:1173:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 <3>LustreError: 6919:0:(osd_handler.c:1054:osd_trans_stop()) Failure in transaction hook: -30 <3>LustreError: 6919:0:(osd_handler.c:1063:osd_trans_stop()) Failure to stop transaction: -30 <2>LDISKFS-fs error (device md0): ldiskfs_mb_new_blocks: Updating bitmap error: [err -30] [pa ffff8860350c8ba8] [phy 34992896] [logic 256] [len 256] [free 256] [error 1] [inode 1917] <3>LustreError: 8967:0:(osd_io.c:1166:osd_ldiskfs_write_record()) md0: error reading offset 2093056 (block 511): rc = -30 <3>LustreError: 8967:0:(llog_osd.c:156:llog_osd_write_blob()) echo-MDT0000-osd: error writing log record: rc = -30 <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted <2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted <3>LustreError: 8967:0:(llog_cat.c:356:llog_cat_add_rec()) llog_write_rec -30: lh=ffff88601d1e4b40 <4> <3>LustreError: 5801:0:(osd_handler.c:863:osd_trans_commit_cb()) transaction @0xffff882945fc28c0 commit error: 2 <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 11 changes, 31 in progress, 0 in flight: -5 <0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) LBUG <4>Pid: 6145, comm: osp-syn-98-0 <4> <4>Call Trace: <4> [<ffffffffa03b3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa03b3e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0eff2e3>] osp_sync_thread+0x753/0x7d0 [osp] <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0 <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <3>LustreError: 6135:0:(llog.c:159:llog_cancel_rec()) echo-OST005d-osc-MDT0000: fail to write header for llog #0x5552:1#00000000: rc = -30 <3>LustreError: 6135:0:(llog_cat.c:538:llog_cat_cancel_records()) echo-OST005d-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30 <3>LustreError: 6135:0:(osp_sync.c:721:osp_sync_process_committed()) echo-OST005d-osc-MDT0000: can't cancel record: -30 <0>Kernel panic - not syncing: LBUG <4>Pid: 6145, comm: osp-syn-98-0 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff8152896c>] ? panic+0xa7/0x16f <4> [<ffffffffa03b3eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0eff2e3>] ? osp_sync_thread+0x753/0x7d0 [osp] <4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0 <4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
We're quickly reaching the point where we're going to have to consider clearing things and starting from scratch which is something we'd really rather not do. We'd appreciate any other options you can present. If remote access would be useful, we can provide that.
If clearing things is the only real option here, can we provide any extra info to determine why this might have happened? As far as we can tell, the only thing that happened was the MDT filling up.