[LU-8252] MDS kernel panic after aborting journal - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
Centos 6.5
Linux 2.6.32-431.23.3.el6_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're having an issue with our mds crashing. This is after recovering from a full md filesystem. We've been deleting from storage to free up metadata space, but have run into these kernel panics.

dmesg logs have the following:

<2>LDISKFS-jfs error (device md0): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 0corrupted: 57 blocks free in bitmap, 6 - in gd
<4>
<3>Aborting journal on device md0-8.
<2>LDISKFS-fs error (device md0): ldiskfs_journal_start_sb: Detected aborted journal
<2>LDISKFS-fs error (device md0) in iam_txn_add: Journal has aborted
<2>LDISKFS-fs (md0): Remounting filesystem read-only
<2>LDISKFS-fs (md0): Remounting filesystem read-only
<3>LustreError: 6919:0:(osd_io.c:1173:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
<3>LustreError: 6919:0:(osd_handler.c:1054:osd_trans_stop()) Failure in transaction hook: -30
<3>LustreError: 6919:0:(osd_handler.c:1063:osd_trans_stop()) Failure to stop transaction: -30
<2>LDISKFS-fs error (device md0): ldiskfs_mb_new_blocks: Updating bitmap error: [err -30] [pa ffff8860350c8ba8] [phy 34992896] [logic 256] [len 256] [free 256] [error 1] [inode 1917]
<3>LustreError: 8967:0:(osd_io.c:1166:osd_ldiskfs_write_record()) md0: error reading offset 2093056 (block 511): rc = -30
<3>LustreError: 8967:0:(llog_osd.c:156:llog_osd_write_blob()) echo-MDT0000-osd: error writing log record: rc = -30
<2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
<2>LDISKFS-fs error (device md0) in start_transaction: Journal has aborted
<3>LustreError: 8967:0:(llog_cat.c:356:llog_cat_add_rec()) llog_write_rec -30: lh=ffff88601d1e4b40
<4>
<3>LustreError: 5801:0:(osd_handler.c:863:osd_trans_commit_cb()) transaction @0xffff882945fc28c0 commit error: 2
<0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 11 changes, 31 in progress, 0 in flight: -5
<0>LustreError: 6145:0:(osp_sync.c:874:osp_sync_thread()) LBUG
<4>Pid: 6145, comm: osp-syn-98-0
<4>
<4>Call Trace:
<4> [<ffffffffa03b3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa03b3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0eff2e3>] osp_sync_thread+0x753/0x7d0 [osp]
<4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
<4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<3>LustreError: 6135:0:(llog.c:159:llog_cancel_rec()) echo-OST005d-osc-MDT0000: fail to write header for llog #0x5552:1#00000000: rc = -30
<3>LustreError: 6135:0:(llog_cat.c:538:llog_cat_cancel_records()) echo-OST005d-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
<3>LustreError: 6135:0:(osp_sync.c:721:osp_sync_process_committed()) echo-OST005d-osc-MDT0000: can't cancel record: -30
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6145, comm: osp-syn-98-0 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
<4> [<ffffffffa03b3eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0eff2e3>] ? osp_sync_thread+0x753/0x7d0 [osp]
<4> [<ffffffff81528df6>] ? schedule+0x176/0x3b0
<4> [<ffffffffa0efeb90>] ? osp_sync_thread+0x0/0x7d0 [osp]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dir.3040644673.bin
4 kB
17/Jun/16 12:35 AM
inode.3040644673.bin
4 kB
17/Jun/16 12:35 AM
vmcore-dmesg.20160607.txt
138 kB
08/Jun/16 10:13 PM
vmcore-dmesg.20160608.txt
157 kB
08/Jun/16 10:13 PM

Issue Links

is related to

LU-1026 ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 23828 corrupted

Resolved

LU-6696 ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Resolved

LU-7114 ldiskfs: corrupted bitmaps handling patches

Resolved

Activity

[LU-8252] MDS kernel panic after aborting journal

Stephen Willey (Inactive) added a comment - 13/Jun/16 4:41 PM

We're quickly reaching the point where we're going to have to consider clearing things and starting from scratch which is something we'd really rather not do. We'd appreciate any other options you can present. If remote access would be useful, we can provide that.

If clearing things is the only real option here, can we provide any extra info to determine why this might have happened? As far as we can tell, the only thing that happened was the MDT filling up.

Stephen Willey (Inactive) added a comment - 13/Jun/16 4:41 PM We're quickly reaching the point where we're going to have to consider clearing things and starting from scratch which is something we'd really rather not do. We'd appreciate any other options you can present. If remote access would be useful, we can provide that. If clearing things is the only real option here, can we provide any extra info to determine why this might have happened? As far as we can tell, the only thing that happened was the MDT filling up.

Stephen Willey (Inactive) added a comment - 13/Jun/16 4:22 PM

The e2fsck seemed to stop at the same place again:

emds1 /root # e2fsck -fvy -C 0 /dev/md0
e2fsck 1.42.13.wc5 (15-Apr-2016)
echo-MDT0000: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map
Internal error: couldn't find dir_info for 3040644673.
e2fsck: aborted

Any suggestions...?

Stephen Willey (Inactive) added a comment - 13/Jun/16 4:22 PM The e2fsck seemed to stop at the same place again: emds1 /root # e2fsck -fvy -C 0 /dev/md0 e2fsck 1.42.13.wc5 (15-Apr-2016) echo-MDT0000: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map Internal error: couldn't find dir_info for 3040644673. e2fsck: aborted Any suggestions...?

Andreas Dilger added a comment - 10/Jun/16 5:21 PM

It looks like you will need to run "e2fsck -fy" again to repair the corrupted directory. I'm not sure why e2fsck aborted, I haven't seen a problem like that before.

Andreas Dilger added a comment - 10/Jun/16 5:21 PM It looks like you will need to run "e2fsck -fy" again to repair the corrupted directory. I'm not sure why e2fsck aborted, I haven't seen a problem like that before.

Peter Jones added a comment - 10/Jun/16 5:20 PM

Yang Sheng

Could you please assist with this issue?

Thanks

Peter

Peter Jones added a comment - 10/Jun/16 5:20 PM Yang Sheng Could you please assist with this issue? Thanks Peter

Stephen Willey (Inactive) added a comment - 10/Jun/16 4:15 PM

Looks like the e2fsck failed out with:

Inode 3002126196 has a bad extended attribute block 14139.  Clear? yes

Inode 3002126196, i_blocks is 8, should be 0.  Fix? yes

Inode 3002126200 has a bad extended attribute block 14140.  Clear? yes

Inode 3002126200, i_blocks is 8, should be 0.  Fix? yes

Pass 2: Checking directory structure
Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map
Directory inode 3040644673, block #0, offset 0: directory corrupted
Salvage? yes

First entry '' (inode=14094) in directory inode 3040644673 (???) should be '.'
Fix? yes

Setting filetype for entry '.' in ??? (3040644673) to 2.
Missing '..' in directory inode 3040644673.
Fix? yes

Setting filetype for entry '..' in ??? (3040644673) to 2.
Directory inode 3040644678, block #0, offset 0: directory corrupted
Salvage? yes

Missing '.' in directory inode 3040644678.
Fix? yes

Setting filetype for entry '.' in ??? (3040644678) to 2.
Missing '..' in directory inode 3040644678.
Fix? yes

Setting filetype for entry '..' in ??? (3040644678) to 2.
Internal error: couldn't find dir_info for 3040644597.

echo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
e2fsck: aborted

echo-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
emds1 /root #

I've subsequently mounted it successfully albeit with warnings:

Jun 10 08:44:13 emds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended
Jun 10 08:44:13 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. quota=off. Opts: 
Jun 10 08:44:13 emds1 kernel: Lustre: echo-MDT0000: used disk, loading
Jun 10 08:44:14 emds1 kernel: Lustre: MGS: non-config logname received: params
Jun 10 08:44:14 emds1 kernel: Lustre: Skipped 1 previous similar message
Jun 10 08:44:14 emds1 kernel: LustreError: 11-0: echo-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Jun 10 08:44:14 emds1 kernel: LustreError: 94030:0:(mdt_handler.c:6274:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device
Jun 10 08:44:19 emds1 kernel: Lustre: MGS: non-config logname received: params
Jun 10 08:44:19 emds1 kernel: Lustre: Skipped 1 previous similar message

We're running some large rsyncs onto it now but I expect more issues of course.

Stephen Willey (Inactive) added a comment - 10/Jun/16 4:15 PM Looks like the e2fsck failed out with: Inode 3002126196 has a bad extended attribute block 14139. Clear? yes Inode 3002126196, i_blocks is 8, should be 0. Fix? yes Inode 3002126200 has a bad extended attribute block 14140. Clear? yes Inode 3002126200, i_blocks is 8, should be 0. Fix? yes Pass 2: Checking directory structure Illegal inode number passed to ext2fs_test_inode_bitmap #0 for in-use inode map Directory inode 3040644673, block #0, offset 0: directory corrupted Salvage? yes First entry '' (inode=14094) in directory inode 3040644673 (???) should be '.' Fix? yes Setting filetype for entry '.' in ??? (3040644673) to 2. Missing '..' in directory inode 3040644673. Fix? yes Setting filetype for entry '..' in ??? (3040644673) to 2. Directory inode 3040644678, block #0, offset 0: directory corrupted Salvage? yes Missing '.' in directory inode 3040644678. Fix? yes Setting filetype for entry '.' in ??? (3040644678) to 2. Missing '..' in directory inode 3040644678. Fix? yes Setting filetype for entry '..' in ??? (3040644678) to 2. Internal error: couldn't find dir_info for 3040644597. echo-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** e2fsck: aborted echo-MDT0000: ***** FILE SYSTEM WAS MODIFIED ***** emds1 /root # I've subsequently mounted it successfully albeit with warnings: Jun 10 08:44:13 emds1 kernel: LDISKFS-fs (md0): warning: mounting fs with errors, running e2fsck is recommended Jun 10 08:44:13 emds1 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. quota=off. Opts: Jun 10 08:44:13 emds1 kernel: Lustre: echo-MDT0000: used disk, loading Jun 10 08:44:14 emds1 kernel: Lustre: MGS: non-config logname received: params Jun 10 08:44:14 emds1 kernel: Lustre: Skipped 1 previous similar message Jun 10 08:44:14 emds1 kernel: LustreError: 11-0: echo-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Jun 10 08:44:14 emds1 kernel: LustreError: 94030:0:(mdt_handler.c:6274:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device Jun 10 08:44:19 emds1 kernel: Lustre: MGS: non-config logname received: params Jun 10 08:44:19 emds1 kernel: Lustre: Skipped 1 previous similar message We're running some large rsyncs onto it now but I expect more issues of course.

Andreas Dilger added a comment - 09/Jun/16 5:45 PM

The core of the problem is an inconsistency in the MDT filesystem that caused the filesystem to be remounted read-only (-30 = -EROFS) to prevent cascading corruption. This will need a full e2fsck run to repair: e2fsck -fy /dev/<MDTDEV>. Please ensure you have the latest e2fsprogs-1.42.13.wc5 installed, which fixes a number of issues in older versions. You may want to consider doing a dry-run e2fsck -fn /dev/<MDTDEV> to log and review what problems are reported before starting the repair. If you want to be especially careful, you can create a snapshot of the MDT device, or do a full device-level backup (dd if=/dev/<MDTDEV> of=/dev/backup bs=1M) in case of problems.

The LBUG is the result of the read-only filesystem not being handled robustly, which is fixed by a patch http://review.whamcloud.com/19856 "~~LU-6696~~ llog: improve error handling", though it is not particularly beneficial to apply this patch to your system once the MDT problem is resolved, since by the time this assertion is hit the read-only MDT is not very useful anymore.

Andreas Dilger added a comment - 09/Jun/16 5:45 PM The core of the problem is an inconsistency in the MDT filesystem that caused the filesystem to be remounted read-only (-30 = -EROFS) to prevent cascading corruption. This will need a full e2fsck run to repair: e2fsck -fy /dev/<MDTDEV> . Please ensure you have the latest e2fsprogs-1.42.13.wc5 installed, which fixes a number of issues in older versions. You may want to consider doing a dry-run e2fsck -fn /dev/<MDTDEV> to log and review what problems are reported before starting the repair. If you want to be especially careful, you can create a snapshot of the MDT device, or do a full device-level backup ( dd if=/dev/<MDTDEV> of=/dev/backup bs=1M ) in case of problems. The LBUG is the result of the read-only filesystem not being handled robustly, which is fixed by a patch http://review.whamcloud.com/19856 " LU-6696 llog: improve error handling", though it is not particularly beneficial to apply this patch to your system once the MDT problem is resolved, since by the time this assertion is hit the read-only MDT is not very useful anymore.

MDS kernel panic after aborting journal

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates