[LU-486] ldiskfs_valid_block_bitmap: Invalid block bitmap Created: 05/Jul/11  Updated: 17/Dec/14  Resolved: 02/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: David Vasil (Inactive) Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: patch
Environment:

Lustre 1.8.4ddn3.1
Kernel 2.6.18-194.32.1.el5
CentOS 5.x


Attachments: File hamster8.log     File messages.20120105.bz2     Text File widow2-e2fsck.log.clean.txt    
Severity: 2
Bugzilla ID: 23,959
Rank (Obsolete): 6112

 Description   

OSS throws LDISKFS-fs error stating that it encountered an invalid block bitmap. This results in the OST being remounted read-only and requiring a reboot of the OSS to recover. A subsequent 'e2fsck -fp <dev>' replays the journal and finds no errors on the OST.

This issue has been seen spuriously during internal stress testing by Bernd and by some customers in the field. It has been seen by other Lustre users as well and reported on the lustre-discuss list. There is a bugzilla ticket open but it has not had any support activity since November 2010. I'm opening a Jira bug so this can be worked on.

https://bugzilla.lustre.org/show_bug.cgi?id=23959

Logs from the start of the invalid block bitmap:
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521146] LDISKFS-fs error (device dm-21): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 57, block = 1867778
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521155] Aborting journal on device dm-21-8.
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521183] LDISKFS-fs error (device dm-21): ldiskfs_journal_start_sb: Detected aborted journal
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521188] LDISKFS-fs (dm-21): Remounting filesystem read-only
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521205] LustreError: 16643:0:(fsfilt-ldiskfs.c:1320:fsfilt_ldiskfs_write_record()) can't start transaction for 37 blocks (128 bytes)
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521212] LustreError: 16643:0:(filter.c:192:filter_finish_transno()) wrote trans 21483454236 for client 1279815f-edd6-33ed-a1d2-a6685e1060af at #1606: err = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521219] LustreError: 16643:0:(filter_io_26.c:520:filter_direct_io()) can't close transaction: -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521228] LDISKFS-fs error (device dm-21) in fsfilt_ldiskfs_commit: IO failure
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521303] LustreError: 16465:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) can't get handle for 582 credits: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521314] LustreError: 16465:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521329] LustreError: 16618:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521337] LustreError: 16455:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521347] LustreError: 16645:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521687] LustreError: 16532:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.522014] LustreError: 16513:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) can't get handle for 582 credits: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.522019] LustreError: 16513:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) Skipped 4 previous similar messages
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.522026] LustreError: 16513:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.523770] LustreError: 16578:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.524759] LDISKFS-fs (dm-21): Remounting filesystem read-only
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529215] LDISKFS-fs error (device dm-21) in ldiskfs_ext_new_extent_cb: Journal has aborted
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529237] LustreError: 16405:0:(fsfilt-ldiskfs.c:1320:fsfilt_ldiskfs_write_record()) can't start transaction for 37 blocks (128 bytes)
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529244] LustreError: 16405:0:(filter.c:192:filter_finish_transno()) wrote trans 21483454237 for client f612f805-2ae3-e606-2bc6-074c557919a6 at #1608: err = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529250] LustreError: 16405:0:(filter_io_26.c:520:filter_direct_io()) can't close transaction: -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529452] LustreError: 18162:0:(obd.h:1394:obd_transno_commit_cb()) lfs0-OST0018: transno 21483454237 commit error: 2
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529690] LustreError: 16435:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30
Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.529703] LustreError: 16410:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30



 Comments   
Comment by Peter Jones [ 05/Jul/11 ]

HongChao

Can you please look into this one?

Thanks

Peter

Comment by Johann Lombardi (Inactive) [ 05/Jul/11 ]

Is the problem persistent across reboot? Also would you have a e2image of the corrupted filesystem?

Comment by David Vasil (Inactive) [ 05/Jul/11 ]

Johann,
The problem is not persistent across reboots. It seems that a reboot is the only way to actually clear the issue as umounts of the device hang indefinitely. When the OSS is rebooted, an e2fsck -fp is run on the device; e2fsck reports no errors/corruption. I do not have an e2image of the device, I will request that one be made the next time this issue is encountered.

Comment by Hongchao Zhang [ 06/Jul/11 ]

there is only two kinds of corruption of block bitmap: the inode bitmap block or block bitmap block, but in this case,
the failed block# (1867778) is the second block of group 57 (57 * 4096 * 8 = 1867776), which should be the block group
description block, and it's weird!

more debug info is needed to make clear where the problem is, and a debug patch is underway to collect more info about
the block group description.

Comment by David Vasil (Inactive) [ 06/Jul/11 ]

I am currently gathering an e2image of the LUN that hit this issue; it has not had the e2fsck run against it yet. I will provide the e2image when it completes. Please let me know what debug patch you would like to try and we will work on getting it on the system.

Comment by Nathan Dauchy (Inactive) [ 07/Jul/11 ]

From linux/fs/ext4/balloc.c, ext4_valid_block_bitmap()...

if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))

{ /* with FLEX_BG, the inode/block bitmaps and itable * blocks may not be in the group at all * so the bitmap validation will be skipped for those groups * or it has to also read the block group where the bitmaps * are located to verify they are set. */ return 1; }

So this may explain why we have hit these errors on our older file system, that I think was originally formatted with an ext3-based ldiskfs, but not the more recent ones. How do I verify whether FLEX_BG is enabled or not?

Given that the check is skipped for many file systems altogether anyway (apparently without much damage), would it make sense to just put in a short term patch to always "return 1", rather that wait for a new check/repair feature to be added to fsck.ext3?

Thanks,
Nathan

Comment by Peter Jones [ 07/Jul/11 ]

Johann

Could you please comment?

Thanks

Peter

Comment by Johann Lombardi (Inactive) [ 07/Jul/11 ]

> So this may explain why we have hit these errors on our older file system, that I think was originally formatted
> with an ext3-based ldiskfs, but not the more recent ones. How do I verify whether FLEX_BG is enabled or not?

You can see this with dumpe2fs -h, e.g.:
$ dumpe2fs -h /dev/sdd1 | grep "Filesystem features"
dumpe2fs 1.41.12 (17-May-2010)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

> Given that the check is skipped for many file systems altogether anyway (apparently without much damage),

Well, it is only skipped for flex_bg which can use a different layout. Unfortunately, the error message is not really helpful and Hongchao's debug patch might help to understand what is going on.

> would it make sense to just put in a short term patch to always "return 1", rather that wait for a new check/repair feature to be added to fsck.ext3?

Well, you can disable the checks if this really hurts production, but it would be great to understand what is going on since it might be a real corruption which can spread into the rest of the filesystem w/o the sanity check.
Is the e2image available?

Comment by David Vasil (Inactive) [ 08/Jul/11 ]

Johann,
The e2image has completed, it is ~250MB bzip2'd. I created the e2image using 'e2image -r /dev/mapper/lun_41 - | bzip2 > outfile.bz2'. How would you like me to get this over to you?

Comment by Peter Jones [ 08/Jul/11 ]

David

Johann is on vacation today but I will email you privately about how to get the file to us

Thanks

Peter

Comment by Hongchao Zhang [ 12/Jul/11 ]

David,

there is some problem to decompress the image file, for the output of e2image is a sparse file and bzip2 can't handle it
efficiently, could you please do it with "e2image -r /dev/mapper/lun_41 | tar -Sj > output.bz2" and upload the file again?

Thanks!

Comment by David Vasil (Inactive) [ 12/Jul/11 ]

Hongchao,
Can this be done while the OST is mounted and in use? Or do I need to do the e2image with the OST unmounted/offline? The OST is currently mounted and marked as inactive by the MDS to prevent it from having objects allocated on it.

Comment by David Vasil (Inactive) [ 12/Jul/11 ]

Hongchao,
Maybe I'm missing something, but I do not believe tar can accept
standard input in the way you are suggesting.

  1. e2image -r /dev/mapper/lun_41 - | tar -Sj > lun_49-dm_21.e2i.tar.bz2
    tar: You must specify one of the `-Acdtrux' options
    Try `tar --help' or `tar --usage' for more information.
    e2image 1.41.12.2.ora1.ddn1 (14-Aug-2010)
  1. e2image -r /dev/mapper/lun_41 - | tar -cSj > lun_49-dm_21.e2i.tar.bz2
    e2image 1.41.12.2.ora1.ddn1 (14-Aug-2010)
    tar: Cowardly refusing to create an empty archive
    Try `tar --help' or `tar --usage' for more information.

I'll try to do this in a two-step process, assuming I have enough disk
space. Per my previous question, is it acceptable to do this while the
OST is online (but deactivated via lctl)?

Comment by David Vasil (Inactive) [ 12/Jul/11 ]

Hongchao,
'e2image -r /dev/mapper/lun_41 lun_41-dm-21.e2i' failed with:

lseek: Invalid argument
...
lseek: Invalid argument
File size limit exceeded

The resulting file was 2TB. Do you need a raw e2image image?

Comment by Johann Lombardi (Inactive) [ 12/Jul/11 ]

Hongchao, i am running the following command:
$ bzcat ./lun_49-dm_21.e2i.bz2 | cp --sparse=always /dev/stdin ./lun

and it creates a spare file:
$ ls -lsh lun
417M rw------ 1 root root 28G Jul 12 21:43 lun

still decompressing ...

Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ]

ok, decompression is done. First of all, could you please confirm that dm-21 on lfs-oss-0-1 is lfs0-OST0018?

The initial error was:

Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521146] LDISKFS-fs error (device dm-21): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 57, block = 1867778

The state of group 57 in the image is the following:

Group 57: (Blocks 1867776-1900543) [ITABLE_ZEROED]
  Checksum 0xa677, unused inodes 8166
  Block bitmap at 1867776 (+0), Inode bitmap at 1867777 (+1)
  Inode table at 1867778-1868289 (+2)
  18915 free blocks, 8190 free inodes, 0 directories, 8166 unused inodes 
  Free blocks: 1868290-1868799, 1869824-1870591, 1870848-1870936, 1870938-1871071, 1871073-1871103, 1871360-1871615, 1871766-1871871, 1875968-1876187, 1876202-1876479, 1876992-1877759, 1877803-1878346, 1878348-1880063, 1880576-1880611, 1880613, 1880615-1880727, 1880731-1881599, 1881899-1885368, 1885370-1885374, 1885376-1885439, 1885696-1887999, 1890304-1892351, 1894400-1894911, 1895424-1896191, 1896198-1896199, 1896201, 1896204-1896208, 1896211-1896212, 1896214-1896225, 1896229-1896231, 1896244-1896246, 1896248, 1896250-1896252, 1896257, 1896265, 1896267-1896270, 1896282-1896300, 1896302-1896309, 1896311-1896320, 1896322-1896323, 1896327-1896331, 1896333-1896342, 1896344-1896351, 1896353-1896383, 1896386-1896387, 1896389, 1896391-1896409, 1896415-1896426, 1896448-1897324, 1897344-1897364, 1897466-1897469, 1897472-1897629, 1897688-1897942, 1897955-1897964, 1897979-1898495, 1899246-1900543
Free inodes: 466946-466954, 466956-475136

So block 1867778 is in inode table (1867778-1868289). The related piece of code is the following:

        /* check whether the inode table block number is set */
        bitmap_blk = ldiskfs_inode_table(sb, desc);
        offset = bitmap_blk - group_first_block;
        next_zero_bit = ldiskfs_find_next_zero_bit(bh->b_data,
                                offset + LDISKFS_SB(sb)->s_itb_per_group,
                                offset);
        if (next_zero_bit >= offset + LDISKFS_SB(sb)->s_itb_per_group)
                /* good bitmap for inode tables */
                return 1;

So we check that the range 1867778-1868289 is marked as allocated in the block bitmap and it is ...

Comment by David Vasil (Inactive) [ 13/Jul/11 ]

Johann,
Thank you for the analysis. Now how does the OST get into this mode and can we change e2fsck to actually fix it?

Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ]

Unfortunately, I still don't know how the OST get into this situation
As for e2fsck, the on-disk state is fine so there is nothing to fix. The problem is actually detected in the bitmap in memory before it hits the disk.

Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ]

David, would it be possible to remount the OST with errors=panic (it would cause the OST to call panic when the assertion is hit) and to collect a kernel crash dump?

Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ]

Hongchao, BTW, it still makes sense to continue working on improving the debug messages printed in ldiskfs_valid_block_bitmap() when we detect a problem. What we have today is really not enough.

Comment by Hongchao Zhang [ 15/Jul/11 ]

the debug patch is at http://review.whamcloud.com/#change,1107

Comment by Peter Jones [ 29/Jul/11 ]

David

Has this diagnostic patch been deployed at the affected site?

Peter

Comment by David Vasil (Inactive) [ 29/Jul/11 ]

Peter,
We are in the process of building the server RPMs and enabling crash
dumps on a test system at the site. Once that is completed we will
apply the updated packages during the next downtime. Thanks!

Comment by Peter Jones [ 29/Jul/11 ]

ok thanks for the update David!

Comment by Peter Jones [ 13/Oct/11 ]

Has the diagnostic patch been rolled out at the customer site yet?

Comment by Nathan Dauchy (Inactive) [ 13/Oct/11 ]

Peter,
As of 9/13/2011, we are running with:
lustre-1.8.6.80-2.6.18_238.12.1.el5_lustre.gd70e443_g9d9d86f
...which includes the diagnostic patch and a fix for group quotas.
We have also mounted the OSS's with "errors=panic".
No "Invalid block bitmap" errors since the upgrade, but we will upload the additional diagnostic information if/when we hit one.
Thanks,
Nathan

Comment by Peter Jones [ 13/Oct/11 ]

ok thanks Nathan!

Comment by David Vasil (Inactive) [ 06/Jan/12 ]

Peter,
We saw this again. I've attached the messages file that was produced as a result of the debug patch developed in this bug.

Also, the vmcore that was produced by kdump is incomplete and is only 8.9GB; so I'm not sure if that is useful.

Comment by Hongchao Zhang [ 12/Apr/12 ]

sorry for delayed response!
the attached log doesn't contain the info produced by the debug patch, it seems the log is incomplete.

Comment by Robin Humble [ 21/Jun/12 ]

We hit this too. conman, fsck, tune2fs -l attached. once the journal was replayed by fsck there was no on-disk corruption found.

we've recently updated from 1.8.5 based to 1.8.7 servers. These OSTs are a few years old with no flex_bg set.

Also we turned on async journals at the same time.

does DDN run with async journals?

Comment by Peter Jones [ 21/Jun/12 ]

Yes I believe they do.

Comment by Shuichi Ihara (Inactive) [ 21/Jun/12 ]

aync journal was enabled by default on 1.8.4ddn3.1, but after release of this, we relay on -wc default. so, by default it's disabled. I'm interested in Robin's comments that once async is enabled, we can hit this..

Comment by Jason Hill (Inactive) [ 10/Jun/13 ]

ORNL just hit this today – we've hit it in the past but with an e2fsck fixing the issue have just pushed through it. This is the second time in as many weeks that we have seen this. Yes, this is Lustre 1.8 (1.8.8); kernel is 2.6.18_308.4.1; and I think this was an original ext3 filesystem so the flex_bg notes from above aren't coming into play – but here's the dumpe2fs.

[root@widow-oss8b2 ~]# dumpe2fs -h /dev/dm-27
dumpe2fs 1.42.3.wc1 (28-May-2012)
Filesystem volume name: widow2-OST0149
Last mounted on: /
Filesystem UUID: 92ee894a-a7dd-400f-914d-09267134a319
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg

Will post the e2fsck log when it is complete.

Comment by Jason Hill (Inactive) [ 10/Jun/13 ]

Not sure if this log of the e2fsck will help or not, but here it is.

We are running the following e2fsprogs:

  1. rpm -qa | grep e2fs
    e2fsprogs-1.42.3.wc1-0redhat
    e2fsprogs-devel-1.42.3.wc1-0redhat
    e2fsprogs-debuginfo-1.42.3.wc1-0redhat
Comment by Bruno Faccini (Inactive) [ 12/Jun/13 ]

Jason, thank's for the e2fsck log, but can you also provide the Console/syslog output showing the "Invalid block bitmap" ??
Also was a crash-dump taken/forced after the error ? Since it is likely to be an in-memory [only?] corruption it may help.

Comment by Jason Hill (Inactive) [ 21/Jun/13 ]

Bruno:

The node was not dumped – preserving the bulk of the OST's on the node is pretty important to us – and by the time we get to the node it's spewed a lot of messages and we may not likely get the information you are looking for. Do you prefer to get a full system image on this? We do have a propensity to hit this issue after a filesystem downtime – had 2 during the week of June 10 (one additional over the comment I added on 6/10), and none this week. I do have syslog and console log that I'll work on attaching.

-J

Comment by Jason Hill (Inactive) [ 11/Oct/13 ]

Just a ping on this. We're still running 1.8.9-wc1 here at ORNL on the production system; we've hit this bug 8 times in the last 6 days. Is it more helpful to dump the node? We will likely be running this SW version until decommissioning in Feburary/March 2014. We'd be interested in seeing this one fixed if possible.

Comment by James Nunez (Inactive) [ 11/Oct/13 ]

Jason,

A quick question, are you running with the debug patch at http://review.whamcloud.com/#/c/1107/ ?

Thanks,
James

Comment by Jason Hill (Inactive) [ 11/Oct/13 ]

James,

We are not running with that patch. We will have a chance to reboot the entire cluster this weekend – should we download, integrate and run this in production on all 144 OSS, 4 MDS, and 1 MGS servers?


-Jason

Comment by James Nunez (Inactive) [ 11/Oct/13 ]

That seems like a large task.

Would someone on this ticket please comment on if the proposed debug patch will give the information we need to debug this issue and, answering Jason's question, what nodes should it be installed on. Is there something else that ORNL can provide us with to better understand this issue?

Thank you

Comment by Jason Hill (Inactive) [ 11/Oct/13 ]

I will also comment that we've seen the increase as the utilization on the filesystems has gone up – we're over 90% on the two filesystems that have hit this issue most frequently in the last week.

Comment by Jason Hill (Inactive) [ 11/Oct/13 ]

Also – James Simmons has a pretty slick build system where integrating the patch would only take a few minutes and the RPM build is 30 mins. We have a scheduled power outage on Saturday so we're going down anyway. The challenge is when do we take it all down again to remove the debug patch? Hopefully when we get a fix for this issue and a new set of RPM's.

Comment by Hongchao Zhang [ 11/Oct/13 ]

Hi Jason

the debug patch at http://review.whamcloud.com/#/c/1107/ has been updated, could you please apply it when you reboot your system? Thanks!

Comment by Bruno Faccini (Inactive) [ 11/Oct/13 ]

Jason, high filesystem/OSTs usage is a known situation to likely encounter this problem.
James, yes running with the debug patch will help to learn more, but definitely a crash-dump at the time the error is detected (ie, "errors=panic" option used for OSTs mounts and Kdump installed+configured) will be the best to debug this problem.
Hongchao, don't you miss the patch/file in your last patch-set you just submitted ??

Comment by James A Simmons [ 11/Oct/13 ]

I will produce the rpms this morning Jason.

Comment by Jason Hill (Inactive) [ 13/Oct/13 ]

These RPM's are in production as of 02:30am on 10/13/2013 and all devices are mounted with -o errors=panic.

Comment by Peter Jones [ 13/Oct/13 ]

Thanks for the update Jason

Comment by Blake Caldwell [ 20/Dec/13 ]

We just caught one on this filesystem with the debug patch.

[root@widow-oss13c2 ~]# rpm -qi lustre
Name : lustre Relocations: (not relocatable)
Version : 1.8.9 Vendor: (none)
Release : 2.6.18_348.3.1.el5.widow Build Date: Fri Oct 11 08:22:48 2013
Install Date: Sun Oct 13 01:22:15 2013 Build Host: tick-mgmt.ccs.ornl.gov
Group : Utilities/System Source RPM: lustre-1.8.9-2.6.18_348.3.1.el5.widow.src.rpm
Size : 2613423 License: GPL
Signature : (none)
URL : http://wiki.whamcloud.com/
Summary : Lustre File System
Description :
Userspace tools and files for the Lustre file system.

[4432932.805736] LDISKFS-fs error (device dm-22): ldiskfs_valid_block_bitmap: Invalid block bitmap - group_first_block = 540311552, block_bitmap = 540311552, inode_bitmap = 540311553 inode_table_bitmap = 540311554, inode_table_block_per_group =512, block_group = 16489, block = 540311554
[4432932.856983] Aborting journal on device dm-22-8.
[4432932.872885] LDISKFS-fs error (device dm-22): ldiskfs_journal_start_sb: Detected aborted journal
[4432932.890598] Kernel panic - not syncing: LDISKFS-fs panic from previous error
[4432932.890599]
[4432932.908627] <2>LDISKFS-fs error (device dm-22): ldiskfs_journal_start_sb: Detected aborted journal
Dec 19 21:53:09 widow-oss13c2 kernel: [ 886.650598] ldiskfs created from ext4-2.6-rhel5
Dec 19 21:53:35 widow-oss13c2 kernel: [ 912.178692] LDISKFS-fs warning (device dm-22): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure
Dec 19 21:53:35 widow-oss13c2 kernel: [ 912.178699] LDISKFS-fs warning (device dm-22): ldiskfs_clear_journal_err: Marking fs in need of filesystem check.
Dec 19 21:53:35 widow-oss13c2 kernel: [ 912.208039] LDISKFS-fs (dm-22): warning: mounting fs with errors, running e2fsck is recommended
Dec 19 21:53:35 widow-oss13c2 kernel: [ 912.263354] LDISKFS-fs (dm-22): recovery complete
Dec 19 21:53:35 widow-oss13c2 kernel: [ 912.303995] LDISKFS-fs (dm-22): mounted filesystem with ordered data mode

KERNEL: /usr/gedi/nfsroot/prod_lustre/usr/lib/debug/lib/modules/2.6.18-348.3.1.el5.widow/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Thu Dec 19 21:26:53 2013
UPTIME: 51 days, 10:01:45
LOAD AVERAGE: 28.90, 20.78, 19.66
TASKS: 1279
NODENAME: widow-oss13c2
RELEASE: 2.6.18-348.3.1.el5.widow
VERSION: #1 SMP Thu Mar 28 08:55:37 EDT 2013
MACHINE: x86_64 (2327 Mhz)
MEMORY: 15.8 GB
PANIC: "[4432932.890598] Kernel panic - not syncing: LDISKFS-fs panic from previous error"
PID: 16006
COMMAND: "ll_ost_io_171"
TASK: ffff81027db5f7e0 [THREAD_INFO: ffff81027db62000]
CPU: 5
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 16006 TASK: ffff81027db5f7e0 CPU: 5 COMMAND: "ll_ost_io_171"
#0 [ffff81027db63448] crash_kexec at ffffffff800b80df
#1 [ffff81027db63508] panic at ffffffff80099933
#2 [ffff81027db635f8] ldiskfs_abort at ffffffff88a4b6d5 [ldiskfs]
#3 [ffff81027db636f8] ldiskfs_journal_start_sb at ffffffff88a4b788 [ldiskfs]
#4 [ffff81027db63708] fsfilt_ldiskfs_brw_start at ffffffff88abddca [fsfilt_ldiskfs]
#5 [ffff81027db637a8] filter_commitrw_write at ffffffff88afa00f [obdfilter]
#6 [ffff81027db63988] filter_commitrw at ffffffff88af21d8 [obdfilter]
#7 [ffff81027db63a38] ost_brw_write at ffffffff88a9ccaf [ost]
#8 [ffff81027db63c58] ost_handle at ffffffff88aa0051 [ost]
#9 [ffff81027db63e08] ptlrpc_server_handle_request at ffffffff88815940 [ptlrpc]
#10 [ffff81027db63eb8] ptlrpc_main at ffffffff88816ff6 [ptlrpc]
#11 [ffff81027db63f48] kernel_thread at ffffffff80061fc1
crash>

[root@widow-oss13c2 ~]# e2fsck -f $lun
e2fsck 1.42.3.wc3 (15-Aug-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (122969124, counted=122938260).
Fix<y>? yes

widow3-OST0039: ***** FILE SYSTEM WAS MODIFIED *****
widow3-OST0039: 1524011/471982080 files (6.4% non-contiguous), 1764983916/1887922176 blocks

Comment by Matt Ezell [ 20/Dec/13 ]

So this tells us (as we've seen before) that part of the inode table is marked as unallocated. It's not clear how unallocated it is (of the 512 blocks)... whether there's a single unallocated block or if the whole thing is zeroed. It would be nice to see the value of next_zero_bit.

Ideally, this would have crashed in ldiskfs_valid_block_bitmap() instead of marking the error and then crashing in ldiskfs_journal_start_sb(). Then we could see what function called ldiskfs_read_block_bitmap(), which might be useful for tracing back the in-memory corruption. I guess for consistency sake, you want to cleanup a bit before panic'ing.

I'm not that familiar with the 'crash' utility. This block bitmap should be in cache somewhere; how do I find it?

Comment by Hongchao Zhang [ 23/Dec/13 ]

Is the content of the block bitmap buffer not printed as it was in the debug path?

        printk(KERN_ERR"block bitmap of block_group %d : \n", block_group);
	for (i = 0; i < (sb->s_blocksize >> 3); i++) {
		printk(KERN_ERR"%016lx ", *(((long int*)bh->b_data) + i));
		if (i &&  ((i % 4) == 0))
			printk(KERN_ERR"\n");
	}

could you please look at the syslog file to check whether this info was contained in it or not? Thanks.

Comment by Matt Ezell [ 23/Dec/13 ]

No, that message is not on the console or syslog. I think ldiskfs_error (ext4_error) aborted the journal and then a different thread noticed the journal was aborted and panic()ed the node. It might be nice to refresh the patch to printk the bitmap before calling ext_error(). Unfortunately, I don't think we will have an opportunity to reboot with an updated image on this file system. And our new stuff (Atlas) was formatted with Lustre 2.4, so it should have flex_bg. I worry that whatever is causing this might still be present in newer versions of ext4/Lustre, but flex_bg will prevent us from noticing right away.

Looking in the crash dump, I have the following process that I think hit the error:

PID: 16218 TASK: ffff8102674b1820 CPU: 0 COMMAND: "ll_ost_io_383"
#0 [ffff8102674c2e60] schedule at ffffffff80066fd0
#1 [ffff8102674c2f38] io_schedule at ffffffff8006780f
#2 [ffff8102674c2f58] sync_buffer at ffffffff80015c90
#3 [ffff8102674c2f68] __wait_on_bit at ffffffff80067a68
#4 [ffff8102674c2fa8] out_of_line_wait_on_bit at ffffffff80067b08
#5 [ffff8102674c3018] __wait_on_buffer at ffffffff8004c94f
#6 [ffff8102674c3028] sync_dirty_buffer at ffffffff8003c7db
#7 [ffff8102674c3048] jbd2_journal_update_superblock at ffffffff88a0ea14 [jbd2]
#8 [ffff8102674c3088] __journal_abort_soft at ffffffff88a0ecdb [jbd2]
#9 [ffff8102674c30a8] jbd2_journal_abort at ffffffff88a0ece9 [jbd2]
#10 [ffff8102674c30b8] ldiskfs_handle_error at ffffffff88a4aea5 [ldiskfs]
#11 [ffff8102674c30d8] __ldiskfs_error at ffffffff88a4b032 [ldiskfs]
#12 [ffff8102674c31d8] ldiskfs_read_block_bitmap at ffffffff88a248c6 [ldiskfs]
#13 [ffff8102674c3268] ldiskfs_mb_mark_diskspace_used at ffffffff88a3e813 [ldiskfs]
#14 [ffff8102674c32d8] ldiskfs_mb_new_blocks at ffffffff88a3ee35 [ldiskfs]
#15 [ffff8102674c3388] ldiskfs_ext_new_extent_cb at ffffffff88abe255 [fsfilt_ldiskfs]
#16 [ffff8102674c3488] ldiskfs_ext_walk_space at ffffffff88a27e63 [ldiskfs]
#17 [ffff8102674c3518] fsfilt_map_nblocks at ffffffff88aba42f [fsfilt_ldiskfs]
#18 [ffff8102674c35c8] fsfilt_ldiskfs_map_ext_inode_pages at ffffffff88aba65a [fsfilt_ldiskfs]
#19 [ffff8102674c3688] fsfilt_ldiskfs_map_inode_pages at ffffffff88aba6c1 [fsfilt_ldiskfs]
#20 [ffff8102674c36d8] filter_direct_io at ffffffff88af84f4 [obdfilter]
#21 [ffff8102674c37a8] filter_commitrw_write at ffffffff88afa9f5 [obdfilter]
#22 [ffff8102674c3988] filter_commitrw at ffffffff88af21d8 [obdfilter]
#23 [ffff8102674c3a38] ost_brw_write at ffffffff88a9ccaf [ost]
#24 [ffff8102674c3c58] ost_handle at ffffffff88aa0051 [ost]
#25 [ffff8102674c3e08] ptlrpc_server_handle_request at ffffffff88815940 [ptlrpc]
#26 [ffff8102674c3eb8] ptlrpc_main at ffffffff88816ff6 [ptlrpc]
#27 [ffff8102674c3f48] kernel_thread at ffffffff80061fc1

Comment by Matt Ezell [ 23/Dec/13 ]

I was able to find the buffer_head for block 540311552:

crash> struct buffer_head ffff8103a942c430
struct buffer_head {
b_state = 9191465,
b_this_page = 0xffff8103a942c430,
b_page = 0xffff810101f8d820,
b_blocknr = 540311552,
b_size = 4096,
b_data = 0xffff8100903dc000 "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\003",
b_bdev = 0xffff8104191c1140,
b_end_io = 0xffffffff80033b32 <end_buffer_read_sync>,
b_private = 0xffff8102d91a3370,
b_assoc_buffers =

{ next = 0xffff8103a942c478, prev = 0xffff8103a942c478 }

,
b_count =

{ counter = 3 }

}

Comment by Hongchao Zhang [ 26/Dec/13 ]

as per the b_data, there is no zero bit in the first (512+2) bits?
and the count of free blocks in problem are 30864(122969124 - 122938260), which is about the capacity of one block group 32768.

the patch is updated to print the content of the bitmap block from

Comment by Peter Jones [ 02/May/14 ]

As per DDN this ticket is no longer relevant. They will open a new ticket if this ever occurs again

Comment by Gerrit Updater [ 17/Dec/14 ]

Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13100
Subject: LU-486 ldiskfs: print debug info for invalid block bitmap
Project: fs/lustre-release
Branch: b2_1
Current Patch Set: 1
Commit: 669b6787b821f769b4112b37620efc9297bc69ea

Generated at Sat Feb 10 01:07:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.