[LU-8252] MDS kernel panic after aborting journal - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
Centos 6.5
Linux 2.6.32-431.23.3.el6_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're having an issue with our mds crashing. This is after recovering from a full md filesystem. We've been deleting from storage to free up metadata space, but have run into these kernel panics.

dmesg logs have the following:

<2>LDISKFS-jfs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 0corrupted: 57 blocks free in bitmap, 6 - in gd
<4>
<3>Aborting journal on device md0-8.
<2>LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
<2>LDISKFS-fs error (device md0) in iam_txn_add: Journal has aborted
<2>LDISKFS-fs (md0): Remounting filesystem read-only
<2>LDISKFS-fs (md0): Remounting filesystem read-only
<3>LustreError: 6919:0:(osd_io.c:1173:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
<3>LustreError: 6919:0:(osd_handler.c:1054:osd_trans_stop()) Failure in transaction hook: -30
<3>LustreError: 6919:0:(osd_handler.c:1063:osd_trans_stop()) Failure to stop transaction: -30
<2>LDISKFS-fs error (device md0): ldiskfs_mb_new_blocks: Updating bitmap error: [err -30] [pa ffff8860350c8ba8] [phy 34992896] [logic 256] [len 256] [free 256] [error 1] [inode 1917]
<3>LustreError: 8967:0:(osd_io.c:1166:osd_ldiskfs_write_record()) md0: error reading offset 2093056 (block 511): rc = -30
<3>LustreError: 8967:0:(llog_osd.c:156:llog_osd_write_blob()) echo-MDT0000-osd: error writing log record: rc = -30
<2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
<2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
<3>LustreError: 8967:0:(llog_cat.c:356:llog_cat_add_rec()) llog_write_rec -30: lh=ffff88601d1e4b40
<4>
<3>LustreError: 5801:0:(osd_handler.c:863:osd_trans_commit_cb()) transaction @0xffff882945fc28c0 commit error: 2
<0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 11 changes, 31 in progress, 0 in flight: -5
<0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) LBUG
<4>Pid: 6145, comm: osp-syn-98-0
<4>
<4>Call Trace:
<4> [<ffffffffa03b3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa03b3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0eff2e3>] osp_sync_thread+0x753/0x7d0 [osp]
<4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
<4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<3>LustreError: 6135:0:(llog.c:159:llog_cancel_rec()) echo-OST005d-osc-MDT0000: fail to write header for llog #0x5552:1#00000000: rc = -30
<3>LustreError: 6135:0:(llog_cat.c:538:llog_cat_cancel_records()) echo-OST005d-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
<3>LustreError: 6135:0:(osp_sync.c:721:osp_sync_process_committed()) echo-OST005d-osc-MDT0000: can't cancel record: -30
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6145, comm: osp-syn-98-0 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
<4> [<ffffffffa03b3eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0eff2e3>] ? osp_sync_thread+0x753/0x7d0 [osp]
<4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
<4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dir.3040644673.bin
4 kB
17/Jun/16 12:35 AM
inode.3040644673.bin
4 kB
17/Jun/16 12:35 AM
vmcore-dmesg.20160607.txt
138 kB
08/Jun/16 10:13 PM
vmcore-dmesg.20160608.txt
157 kB
08/Jun/16 10:13 PM

Issue Links

is related to

LU-1026 ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828 corrupted

Resolved

LU-6696 ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Resolved

LU-7114 ldiskfs: corrupted bitmaps handling patches

Resolved

Activity

People

Assignee:: Yang Sheng

Reporter:: Cory Brassington (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Jun/16 10:13 PM

Updated:: 01/Feb/17 5:00 PM

Resolved:: 31/Jan/17 8:15 PM