<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:07:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-486] ldiskfs_valid_block_bitmap: Invalid block bitmap</title>
                <link>https://jira.whamcloud.com/browse/LU-486</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;OSS throws LDISKFS-fs error stating that it encountered an invalid block bitmap.  This results in the OST being remounted read-only and requiring a reboot of the OSS to recover.  A subsequent &apos;e2fsck -fp &amp;lt;dev&amp;gt;&apos; replays the journal and finds no errors on the OST.&lt;/p&gt;

&lt;p&gt;This issue has been seen spuriously during internal stress testing by Bernd and by some customers in the field.  It has been seen by other Lustre users as well and reported on the lustre-discuss list.  There is a bugzilla ticket open but it has not had any support activity since November 2010.  I&apos;m opening a Jira bug so this can be worked on.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=23959&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=23959&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logs from the start of the invalid block bitmap: &lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521146&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-21): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 57, block = 1867778&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521155&amp;#93;&lt;/span&gt; Aborting journal on device dm-21-8.&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521183&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-21): ldiskfs_journal_start_sb: Detected aborted journal&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521188&amp;#93;&lt;/span&gt; LDISKFS-fs (dm-21): Remounting filesystem read-only&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521205&amp;#93;&lt;/span&gt; LustreError: 16643:0:(fsfilt-ldiskfs.c:1320:fsfilt_ldiskfs_write_record()) can&apos;t start transaction for 37 blocks (128 bytes)&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521212&amp;#93;&lt;/span&gt; LustreError: 16643:0:(filter.c:192:filter_finish_transno()) wrote trans 21483454236 for client 1279815f-edd6-33ed-a1d2-a6685e1060af at #1606: err = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521219&amp;#93;&lt;/span&gt; LustreError: 16643:0:(filter_io_26.c:520:filter_direct_io()) can&apos;t close transaction: -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521228&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-21) in fsfilt_ldiskfs_commit: IO failure&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521303&amp;#93;&lt;/span&gt; LustreError: 16465:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) can&apos;t get handle for 582 credits: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521314&amp;#93;&lt;/span&gt; LustreError: 16465:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521329&amp;#93;&lt;/span&gt; LustreError: 16618:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521337&amp;#93;&lt;/span&gt; LustreError: 16455:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521347&amp;#93;&lt;/span&gt; LustreError: 16645:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.521687&amp;#93;&lt;/span&gt; LustreError: 16532:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.522014&amp;#93;&lt;/span&gt; LustreError: 16513:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) can&apos;t get handle for 582 credits: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.522019&amp;#93;&lt;/span&gt; LustreError: 16513:0:(fsfilt-ldiskfs.c:496:fsfilt_ldiskfs_brw_start()) Skipped 4 previous similar messages&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.522026&amp;#93;&lt;/span&gt; LustreError: 16513:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.523770&amp;#93;&lt;/span&gt; LustreError: 16578:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.524759&amp;#93;&lt;/span&gt; LDISKFS-fs (dm-21): Remounting filesystem read-only&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529215&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-21) in ldiskfs_ext_new_extent_cb: Journal has aborted&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529237&amp;#93;&lt;/span&gt; LustreError: 16405:0:(fsfilt-ldiskfs.c:1320:fsfilt_ldiskfs_write_record()) can&apos;t start transaction for 37 blocks (128 bytes)&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529244&amp;#93;&lt;/span&gt; LustreError: 16405:0:(filter.c:192:filter_finish_transno()) wrote trans 21483454237 for client f612f805-2ae3-e606-2bc6-074c557919a6 at #1608: err = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529250&amp;#93;&lt;/span&gt; LustreError: 16405:0:(filter_io_26.c:520:filter_direct_io()) can&apos;t close transaction: -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529452&amp;#93;&lt;/span&gt; LustreError: 18162:0:(obd.h:1394:obd_transno_commit_cb()) lfs0-OST0018: transno 21483454237 commit error: 2&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529690&amp;#93;&lt;/span&gt; LustreError: 16435:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;br/&gt;
Jul  2 23:23:20 lfs-oss-0-1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;4424700.529703&amp;#93;&lt;/span&gt; LustreError: 16410:0:(filter_io_26.c:690:filter_commitrw_write()) error starting transaction: rc = -30&lt;/p&gt;</description>
                <environment>Lustre 1.8.4ddn3.1&lt;br/&gt;
Kernel 2.6.18-194.32.1.el5&lt;br/&gt;
CentOS 5.x</environment>
        <key id="11279">LU-486</key>
            <summary>ldiskfs_valid_block_bitmap: Invalid block bitmap</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="dvasil@ddn.com">David Vasil</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 5 Jul 2011 15:04:42 +0000</created>
                <updated>Wed, 17 Dec 2014 13:35:21 +0000</updated>
                            <resolved>Fri, 2 May 2014 20:22:17 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>22</watches>
                                                                            <comments>
                            <comment id="17256" author="pjones" created="Tue, 5 Jul 2011 15:41:27 +0000"  >&lt;p&gt;HongChao&lt;/p&gt;

&lt;p&gt;Can you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="17257" author="johann" created="Tue, 5 Jul 2011 15:44:09 +0000"  >&lt;p&gt;Is the problem persistent across reboot? Also would you have a e2image of the corrupted filesystem?&lt;/p&gt;</comment>
                            <comment id="17258" author="dvasil@ddn.com" created="Tue, 5 Jul 2011 15:58:36 +0000"  >&lt;p&gt;Johann,&lt;br/&gt;
  The problem is not persistent across reboots.  It seems that a reboot is the only way to actually clear the issue as umounts of the device hang indefinitely.  When the OSS is rebooted, an e2fsck -fp is run on the device; e2fsck reports no errors/corruption.  I do not have an e2image of the device, I will request that one be made the next time this issue is encountered.&lt;/p&gt;</comment>
                            <comment id="17287" author="hongchao.zhang" created="Wed, 6 Jul 2011 08:06:36 +0000"  >&lt;p&gt;there is only two kinds of corruption of block bitmap: the inode bitmap block or block bitmap block, but in this case,&lt;br/&gt;
the failed block# (1867778) is the second block of group 57 (57 * 4096 * 8 = 1867776), which should be the block group &lt;br/&gt;
description block, and it&apos;s weird!&lt;/p&gt;

&lt;p&gt;more debug info is needed to make clear where the problem is, and a debug patch is underway to collect more info about&lt;br/&gt;
the block group description. &lt;/p&gt;</comment>
                            <comment id="17297" author="dvasil@ddn.com" created="Wed, 6 Jul 2011 13:29:54 +0000"  >&lt;p&gt;I am currently gathering an e2image of the LUN that hit this issue; it has not had the e2fsck run against it yet.  I will provide the e2image when it completes.  Please let me know what debug patch you would like to try and we will work on getting it on the system.&lt;/p&gt;</comment>
                            <comment id="17391" author="ndauchy" created="Thu, 7 Jul 2011 17:25:25 +0000"  >&lt;p&gt;From linux/fs/ext4/balloc.c, ext4_valid_block_bitmap()...&lt;/p&gt;

&lt;p&gt;        if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) &lt;/p&gt;
{
                /* with FLEX_BG, the inode/block bitmaps and itable
                 * blocks may not be in the group at all
                 * so the bitmap validation will be skipped for those groups
                 * or it has to also read the block group where the bitmaps
                 * are located to verify they are set.
                 */
                return 1;
        }

&lt;p&gt;So this may explain why we have hit these errors on our older file system, that I think was originally formatted with an ext3-based ldiskfs, but not the more recent ones.  How do I verify whether FLEX_BG is enabled or not?&lt;/p&gt;

&lt;p&gt;Given that the check is skipped for many file systems altogether anyway (apparently without much damage), would it make sense to just put in a short term patch to always &quot;return 1&quot;, rather that wait for a new check/repair feature to be added to fsck.ext3?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Nathan&lt;/p&gt;</comment>
                            <comment id="17392" author="pjones" created="Thu, 7 Jul 2011 17:30:45 +0000"  >&lt;p&gt;Johann&lt;/p&gt;

&lt;p&gt;Could you please comment?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="17438" author="johann" created="Thu, 7 Jul 2011 19:23:57 +0000"  >&lt;p&gt;&amp;gt; So this may explain why we have hit these errors on our older file system, that I think was originally formatted&lt;br/&gt;
&amp;gt; with an ext3-based ldiskfs, but not the more recent ones. How do I verify whether FLEX_BG is enabled or not?&lt;/p&gt;

&lt;p&gt;You can see this with dumpe2fs -h, e.g.:&lt;br/&gt;
$ dumpe2fs -h /dev/sdd1 | grep &quot;Filesystem features&quot;&lt;br/&gt;
dumpe2fs 1.41.12 (17-May-2010)&lt;br/&gt;
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize&lt;/p&gt;

&lt;p&gt;&amp;gt; Given that the check is skipped for many file systems altogether anyway (apparently without much damage),&lt;/p&gt;

&lt;p&gt;Well, it is only skipped for flex_bg which can use a different layout. Unfortunately, the error message is not really helpful and Hongchao&apos;s debug patch might help to understand what is going on.&lt;/p&gt;

&lt;p&gt;&amp;gt; would it make sense to just put in a short term patch to always &quot;return 1&quot;, rather that wait for a new check/repair feature to be added to fsck.ext3?&lt;/p&gt;

&lt;p&gt;Well, you can disable the checks if this really hurts production, but it would be great to understand what is going on since it &lt;b&gt;might&lt;/b&gt; be a real corruption which can spread into the rest of the filesystem w/o the sanity check.&lt;br/&gt;
Is the e2image available?&lt;/p&gt;</comment>
                            <comment id="17460" author="dvasil@ddn.com" created="Fri, 8 Jul 2011 13:41:59 +0000"  >&lt;p&gt;Johann,&lt;br/&gt;
  The e2image has completed, it is ~250MB bzip2&apos;d.  I created the e2image using &apos;e2image -r /dev/mapper/lun_41 - | bzip2 &amp;gt; outfile.bz2&apos;.  How would you like me to get this over to you?&lt;/p&gt;</comment>
                            <comment id="17483" author="pjones" created="Fri, 8 Jul 2011 14:23:37 +0000"  >&lt;p&gt;David&lt;/p&gt;

&lt;p&gt;Johann is on vacation today but I will email you privately about how to get the file to us&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="17602" author="hongchao.zhang" created="Tue, 12 Jul 2011 08:26:22 +0000"  >&lt;p&gt;David,&lt;/p&gt;

&lt;p&gt;there is some problem to decompress the image file, for the output of e2image is a sparse file and bzip2 can&apos;t handle it&lt;br/&gt;
efficiently, could you please do it with &quot;e2image -r /dev/mapper/lun_41 | tar -Sj &amp;gt; output.bz2&quot; and upload the file again?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="17605" author="dvasil@ddn.com" created="Tue, 12 Jul 2011 09:32:14 +0000"  >&lt;p&gt;Hongchao,&lt;br/&gt;
  Can this be done while the OST is mounted and in use?  Or do I need to do the e2image with the OST unmounted/offline?  The OST is currently mounted and marked as inactive by the MDS to prevent it from having objects allocated on it.&lt;/p&gt;</comment>
                            <comment id="17610" author="dvasil@ddn.com" created="Tue, 12 Jul 2011 12:48:30 +0000"  >&lt;p&gt;Hongchao,&lt;br/&gt;
   Maybe I&apos;m missing something, but I do not believe tar can accept &lt;br/&gt;
standard input in the way you are suggesting.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;e2image -r /dev/mapper/lun_41 - | tar -Sj &amp;gt; lun_49-dm_21.e2i.tar.bz2&lt;br/&gt;
tar: You must specify one of the `-Acdtrux&apos; options&lt;br/&gt;
Try `tar --help&apos; or `tar --usage&apos; for more information.&lt;br/&gt;
e2image 1.41.12.2.ora1.ddn1 (14-Aug-2010)&lt;/li&gt;
&lt;/ol&gt;



&lt;ol&gt;
	&lt;li&gt;e2image -r /dev/mapper/lun_41 - | tar -cSj &amp;gt; lun_49-dm_21.e2i.tar.bz2&lt;br/&gt;
e2image 1.41.12.2.ora1.ddn1 (14-Aug-2010)&lt;br/&gt;
tar: Cowardly refusing to create an empty archive&lt;br/&gt;
Try `tar --help&apos; or `tar --usage&apos; for more information.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;I&apos;ll try to do this in a two-step process, assuming I have enough disk &lt;br/&gt;
space.  Per my previous question, is it acceptable to do this while the &lt;br/&gt;
OST is online (but deactivated via lctl)?&lt;/p&gt;
</comment>
                            <comment id="17612" author="dvasil@ddn.com" created="Tue, 12 Jul 2011 14:26:30 +0000"  >&lt;p&gt;Hongchao,&lt;br/&gt;
   &apos;e2image -r /dev/mapper/lun_41 lun_41-dm-21.e2i&apos; failed with:&lt;/p&gt;

&lt;p&gt;lseek: Invalid argument&lt;br/&gt;
...&lt;br/&gt;
lseek: Invalid argument&lt;br/&gt;
File size limit exceeded&lt;/p&gt;

&lt;p&gt;The resulting file was 2TB.  Do you need a raw e2image image?&lt;/p&gt;
</comment>
                            <comment id="17613" author="johann" created="Tue, 12 Jul 2011 15:44:36 +0000"  >&lt;p&gt;Hongchao, i am running the following command:&lt;br/&gt;
$ bzcat ./lun_49-dm_21.e2i.bz2 | cp --sparse=always /dev/stdin ./lun&lt;/p&gt;

&lt;p&gt;and it creates a spare file:&lt;br/&gt;
$ ls -lsh lun&lt;br/&gt;
417M &lt;del&gt;rw&lt;/del&gt;------ 1 root root 28G Jul 12 21:43 lun&lt;/p&gt;

&lt;p&gt;still decompressing ...&lt;/p&gt;</comment>
                            <comment id="17626" author="johann" created="Wed, 13 Jul 2011 09:06:48 +0000"  >&lt;p&gt;ok, decompression is done. First of all, could you please confirm that dm-21 on lfs-oss-0-1 is lfs0-OST0018?&lt;/p&gt;

&lt;p&gt;The initial error was:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521146] LDISKFS-fs error (device dm-21): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 57, block = 1867778
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The state of group 57 in the image is the following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Group 57: (Blocks 1867776-1900543) [ITABLE_ZEROED]
  Checksum 0xa677, unused inodes 8166
  Block bitmap at 1867776 (+0), Inode bitmap at 1867777 (+1)
  Inode table at 1867778-1868289 (+2)
  18915 free blocks, 8190 free inodes, 0 directories, 8166 unused inodes 
  Free blocks: 1868290-1868799, 1869824-1870591, 1870848-1870936, 1870938-1871071, 1871073-1871103, 1871360-1871615, 1871766-1871871, 1875968-1876187, 1876202-1876479, 1876992-1877759, 1877803-1878346, 1878348-1880063, 1880576-1880611, 1880613, 1880615-1880727, 1880731-1881599, 1881899-1885368, 1885370-1885374, 1885376-1885439, 1885696-1887999, 1890304-1892351, 1894400-1894911, 1895424-1896191, 1896198-1896199, 1896201, 1896204-1896208, 1896211-1896212, 1896214-1896225, 1896229-1896231, 1896244-1896246, 1896248, 1896250-1896252, 1896257, 1896265, 1896267-1896270, 1896282-1896300, 1896302-1896309, 1896311-1896320, 1896322-1896323, 1896327-1896331, 1896333-1896342, 1896344-1896351, 1896353-1896383, 1896386-1896387, 1896389, 1896391-1896409, 1896415-1896426, 1896448-1897324, 1897344-1897364, 1897466-1897469, 1897472-1897629, 1897688-1897942, 1897955-1897964, 1897979-1898495, 1899246-1900543
Free inodes: 466946-466954, 466956-475136
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So block 1867778 is in inode table (1867778-1868289). The related piece of code is the following:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-comment&quot;&gt;/* check whether the inode table block number is set */&lt;/span&gt;
        bitmap_blk = ldiskfs_inode_table(sb, desc);
        offset = bitmap_blk - group_first_block;
        next_zero_bit = ldiskfs_find_next_zero_bit(bh-&amp;gt;b_data,
                                offset + LDISKFS_SB(sb)-&amp;gt;s_itb_per_group,
                                offset);
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (next_zero_bit &amp;gt;= offset + LDISKFS_SB(sb)-&amp;gt;s_itb_per_group)
                &lt;span class=&quot;code-comment&quot;&gt;/* good bitmap &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; inode tables */&lt;/span&gt;
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; 1;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So we check that the range 1867778-1868289 is marked as allocated in the block bitmap and it is ...&lt;/p&gt;</comment>
                            <comment id="17628" author="dvasil@ddn.com" created="Wed, 13 Jul 2011 11:21:34 +0000"  >&lt;p&gt;Johann,&lt;br/&gt;
  Thank you for the analysis.  Now how does the OST get into this mode and can we change e2fsck to actually fix it?&lt;/p&gt;</comment>
                            <comment id="17629" author="johann" created="Wed, 13 Jul 2011 11:41:36 +0000"  >&lt;p&gt;Unfortunately, I still don&apos;t know how the OST get into this situation &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;br/&gt;
As for e2fsck, the on-disk state is fine so there is nothing to fix. The problem is actually detected in the bitmap in memory before it hits the disk.&lt;/p&gt;</comment>
                            <comment id="17630" author="johann" created="Wed, 13 Jul 2011 11:43:25 +0000"  >&lt;p&gt;David, would it be possible to remount the OST with errors=panic (it would cause the OST to call panic when the assertion is hit) and to collect a kernel crash dump?&lt;/p&gt;</comment>
                            <comment id="17631" author="johann" created="Wed, 13 Jul 2011 11:50:52 +0000"  >&lt;p&gt;Hongchao, BTW, it still makes sense to continue working on improving the debug messages printed in ldiskfs_valid_block_bitmap() when we detect a problem. What we have today is really not enough.&lt;/p&gt;</comment>
                            <comment id="17891" author="hongchao.zhang" created="Fri, 15 Jul 2011 09:27:53 +0000"  >&lt;p&gt;the debug patch is at &lt;a href=&quot;http://review.whamcloud.com/#change,1107&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,1107&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="18535" author="pjones" created="Fri, 29 Jul 2011 12:14:53 +0000"  >&lt;p&gt;David&lt;/p&gt;

&lt;p&gt;Has this diagnostic patch been deployed at the affected site?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="18537" author="dvasil@ddn.com" created="Fri, 29 Jul 2011 12:21:30 +0000"  >&lt;p&gt;Peter,&lt;br/&gt;
   We are in the process of building the server RPMs and enabling crash &lt;br/&gt;
dumps on a test system at the site.  Once that is completed we will &lt;br/&gt;
apply the updated packages during the next downtime.  Thanks!&lt;/p&gt;
</comment>
                            <comment id="18538" author="pjones" created="Fri, 29 Jul 2011 12:31:58 +0000"  >&lt;p&gt;ok thanks for the update David!&lt;/p&gt;</comment>
                            <comment id="21199" author="pjones" created="Thu, 13 Oct 2011 10:15:02 +0000"  >&lt;p&gt;Has the diagnostic patch been rolled out at the customer site yet?&lt;/p&gt;</comment>
                            <comment id="21201" author="ndauchy" created="Thu, 13 Oct 2011 10:35:31 +0000"  >&lt;p&gt;Peter,&lt;br/&gt;
As of 9/13/2011, we are running with:&lt;br/&gt;
lustre-1.8.6.80-2.6.18_238.12.1.el5_lustre.gd70e443_g9d9d86f&lt;br/&gt;
...which includes the diagnostic patch and a fix for group quotas.&lt;br/&gt;
We have also mounted the OSS&apos;s with &quot;errors=panic&quot;.&lt;br/&gt;
No &quot;Invalid block bitmap&quot; errors since the upgrade, but we will upload the additional diagnostic information if/when we hit one.&lt;br/&gt;
Thanks,&lt;br/&gt;
Nathan&lt;/p&gt;</comment>
                            <comment id="21202" author="pjones" created="Thu, 13 Oct 2011 10:38:50 +0000"  >&lt;p&gt;ok thanks Nathan!&lt;/p&gt;</comment>
                            <comment id="26019" author="dvasil@ddn.com" created="Fri, 6 Jan 2012 13:31:12 +0000"  >&lt;p&gt;Peter,&lt;br/&gt;
  We saw this again.  I&apos;ve attached the messages file that was produced as a result of the debug patch developed in this bug.&lt;/p&gt;

&lt;p&gt;Also, the vmcore that was produced by kdump is incomplete and is only 8.9GB; so I&apos;m not sure if that is useful.&lt;/p&gt;</comment>
                            <comment id="34590" author="hongchao.zhang" created="Thu, 12 Apr 2012 08:37:06 +0000"  >&lt;p&gt;sorry for delayed response! &lt;br/&gt;
the attached log doesn&apos;t contain the info produced by the debug patch, it seems the log is incomplete.&lt;/p&gt;</comment>
                            <comment id="40971" author="rjh" created="Thu, 21 Jun 2012 02:52:31 +0000"  >&lt;p&gt;We hit this too. conman, fsck, tune2fs -l attached. once the journal was replayed by fsck there was no on-disk corruption found.&lt;/p&gt;

&lt;p&gt;we&apos;ve recently updated from 1.8.5 based to 1.8.7 servers. These OSTs are a few years old with no flex_bg set.&lt;/p&gt;

&lt;p&gt;Also we turned on async journals at the same time.&lt;/p&gt;

&lt;p&gt;does DDN run with async journals?&lt;/p&gt;</comment>
                            <comment id="40985" author="pjones" created="Thu, 21 Jun 2012 08:47:50 +0000"  >&lt;p&gt;Yes I believe they do.&lt;/p&gt;</comment>
                            <comment id="41014" author="ihara" created="Thu, 21 Jun 2012 20:41:12 +0000"  >&lt;p&gt;aync journal was enabled by default on 1.8.4ddn3.1, but after release of this, we relay on -wc default. so, by default it&apos;s disabled. I&apos;m interested in Robin&apos;s comments that once async is enabled, we can hit this..&lt;/p&gt;</comment>
                            <comment id="60242" author="hilljjornl" created="Mon, 10 Jun 2013 13:58:27 +0000"  >&lt;p&gt;ORNL just hit this today &amp;#8211; we&apos;ve hit it in the past but with an e2fsck fixing the issue have just pushed through it. This is the second time in as many weeks that we have seen this. Yes, this is Lustre 1.8 (1.8.8); kernel is 2.6.18_308.4.1; and I think this was an original ext3 filesystem so the flex_bg notes from above aren&apos;t coming into play &amp;#8211; but here&apos;s the dumpe2fs.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@widow-oss8b2 ~&amp;#93;&lt;/span&gt;# dumpe2fs -h /dev/dm-27&lt;br/&gt;
dumpe2fs 1.42.3.wc1 (28-May-2012)&lt;br/&gt;
Filesystem volume name:   widow2-OST0149&lt;br/&gt;
Last mounted on:          /&lt;br/&gt;
Filesystem UUID:          92ee894a-a7dd-400f-914d-09267134a319&lt;br/&gt;
Filesystem magic number:  0xEF53&lt;br/&gt;
Filesystem revision #:    1 (dynamic)&lt;br/&gt;
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg&lt;/p&gt;

&lt;p&gt;Will post the e2fsck log when it is complete.&lt;/p&gt;</comment>
                            <comment id="60247" author="hilljjornl" created="Mon, 10 Jun 2013 15:20:03 +0000"  >&lt;p&gt;Not sure if this log of the e2fsck will help or not, but here it is.&lt;/p&gt;

&lt;p&gt;We are running the following e2fsprogs:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;rpm -qa | grep e2fs&lt;br/&gt;
e2fsprogs-1.42.3.wc1-0redhat&lt;br/&gt;
e2fsprogs-devel-1.42.3.wc1-0redhat&lt;br/&gt;
e2fsprogs-debuginfo-1.42.3.wc1-0redhat&lt;/li&gt;
&lt;/ol&gt;



</comment>
                            <comment id="60456" author="bfaccini" created="Wed, 12 Jun 2013 17:04:43 +0000"  >&lt;p&gt;Jason, thank&apos;s for the e2fsck log, but can you also provide the Console/syslog output showing the &quot;Invalid block bitmap&quot; ??&lt;br/&gt;
Also was a crash-dump taken/forced after the error ? Since it is likely to be an in-memory &lt;span class=&quot;error&quot;&gt;&amp;#91;only?&amp;#93;&lt;/span&gt; corruption it may help.&lt;/p&gt;
</comment>
                            <comment id="61033" author="hilljjornl" created="Fri, 21 Jun 2013 19:29:36 +0000"  >&lt;p&gt;Bruno:&lt;/p&gt;

&lt;p&gt;The node was not dumped &amp;#8211; preserving the bulk of the OST&apos;s on the node is pretty important to us &amp;#8211; and by the time we get to the node it&apos;s spewed a lot of messages and we may not likely get the information you are looking for. Do you prefer to get a full system image on this? We do have a propensity to hit this issue after a filesystem downtime &amp;#8211; had 2 during the week of June 10 (one additional over the comment I added on 6/10), and none this week. I do have syslog and console log that I&apos;ll work on attaching. &lt;/p&gt;

&lt;p&gt;-J&lt;/p&gt;</comment>
                            <comment id="68784" author="hilljjornl" created="Fri, 11 Oct 2013 00:41:44 +0000"  >&lt;p&gt;Just a ping on this. We&apos;re still running 1.8.9-wc1 here at ORNL on the production system; we&apos;ve hit this bug 8 times in the last 6 days. Is it more helpful to dump the node? We will likely be running this SW version until decommissioning in Feburary/March 2014. We&apos;d be interested in seeing this one fixed if possible.&lt;/p&gt;</comment>
                            <comment id="68787" author="jamesanunez" created="Fri, 11 Oct 2013 01:20:54 +0000"  >&lt;p&gt;Jason, &lt;/p&gt;

&lt;p&gt;A quick question, are you running with the debug patch at &lt;a href=&quot;http://review.whamcloud.com/#/c/1107/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/1107/&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
James&lt;/p&gt;</comment>
                            <comment id="68789" author="hilljjornl" created="Fri, 11 Oct 2013 01:26:55 +0000"  >&lt;p&gt;James,&lt;/p&gt;

&lt;p&gt;We are not running with that patch. We will have a chance to reboot the entire cluster this weekend &amp;#8211; should we download, integrate and run this in production on all 144 OSS, 4 MDS, and 1 MGS servers?&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="68790" author="jamesanunez" created="Fri, 11 Oct 2013 01:38:02 +0000"  >&lt;p&gt;That seems like a large task.&lt;/p&gt;

&lt;p&gt;Would someone on this ticket please comment on if the proposed debug patch will give the information we need to debug this issue and, answering Jason&apos;s question, what nodes should it be installed on. Is there something else that ORNL can provide us with to better understand this issue?&lt;/p&gt;

&lt;p&gt;Thank you&lt;/p&gt;</comment>
                            <comment id="68792" author="hilljjornl" created="Fri, 11 Oct 2013 01:59:30 +0000"  >&lt;p&gt;I will also comment that we&apos;ve seen the increase as the utilization on the filesystems has gone up &amp;#8211; we&apos;re over 90% on the two filesystems that have hit this issue most frequently in the last week.&lt;/p&gt;</comment>
                            <comment id="68793" author="hilljjornl" created="Fri, 11 Oct 2013 02:01:05 +0000"  >&lt;p&gt;Also &amp;#8211; James Simmons has a pretty slick build system where integrating the patch would only take a few minutes and the RPM build is 30 mins. We have a scheduled power outage on Saturday so we&apos;re going down anyway. The challenge is when do we take it all down again to remove the debug patch? Hopefully when we get a fix for this issue and a new set of RPM&apos;s.&lt;/p&gt;</comment>
                            <comment id="68801" author="hongchao.zhang" created="Fri, 11 Oct 2013 07:59:45 +0000"  >&lt;p&gt;Hi Jason&lt;/p&gt;

&lt;p&gt;the debug patch at &lt;a href=&quot;http://review.whamcloud.com/#/c/1107/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/1107/&lt;/a&gt; has been updated, could you please apply it when you reboot your system? Thanks!&lt;/p&gt;</comment>
                            <comment id="68802" author="bfaccini" created="Fri, 11 Oct 2013 08:23:47 +0000"  >&lt;p&gt;Jason, high filesystem/OSTs usage is a known situation to likely encounter this problem.&lt;br/&gt;
James, yes running with the debug patch will help to learn more, but definitely a crash-dump at the time the error is detected (ie, &quot;errors=panic&quot; option used for OSTs mounts and Kdump installed+configured) will be the best to debug this problem.&lt;br/&gt;
Hongchao, don&apos;t you miss the patch/file in your last patch-set you just submitted ??&lt;/p&gt;</comment>
                            <comment id="68808" author="simmonsja" created="Fri, 11 Oct 2013 12:08:07 +0000"  >&lt;p&gt;I will produce the rpms this morning Jason.&lt;/p&gt;</comment>
                            <comment id="68888" author="hilljjornl" created="Sun, 13 Oct 2013 14:50:24 +0000"  >&lt;p&gt;These RPM&apos;s are in production as of 02:30am on 10/13/2013 and all devices are mounted with -o errors=panic. &lt;/p&gt;</comment>
                            <comment id="68889" author="pjones" created="Sun, 13 Oct 2013 15:59:14 +0000"  >&lt;p&gt;Thanks for the update Jason&lt;/p&gt;</comment>
                            <comment id="73907" author="blakecaldwell" created="Fri, 20 Dec 2013 04:00:00 +0000"  >&lt;p&gt;We just caught one on this filesystem with the debug patch.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@widow-oss13c2 ~&amp;#93;&lt;/span&gt;# rpm -qi lustre       &lt;br/&gt;
Name        : lustre                       Relocations: (not relocatable)&lt;br/&gt;
Version     : 1.8.9                             Vendor: (none)&lt;br/&gt;
Release     : 2.6.18_348.3.1.el5.widow      Build Date: Fri Oct 11 08:22:48 2013&lt;br/&gt;
Install Date: Sun Oct 13 01:22:15 2013         Build Host: tick-mgmt.ccs.ornl.gov&lt;br/&gt;
Group       : Utilities/System              Source RPM: lustre-1.8.9-2.6.18_348.3.1.el5.widow.src.rpm&lt;br/&gt;
Size        : 2613423                          License: GPL&lt;br/&gt;
Signature   : (none)&lt;br/&gt;
URL         : &lt;a href=&quot;http://wiki.whamcloud.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://wiki.whamcloud.com/&lt;/a&gt;&lt;br/&gt;
Summary     : Lustre File System&lt;br/&gt;
Description :&lt;br/&gt;
Userspace tools and files for the Lustre file system.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.805736&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-22): ldiskfs_valid_block_bitmap: Invalid block bitmap - group_first_block = 540311552, block_bitmap = 540311552, inode_bitmap = 540311553 inode_table_bitmap = 540311554, inode_table_block_per_group =512, block_group = 16489, block = 540311554&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.856983&amp;#93;&lt;/span&gt; Aborting journal on device dm-22-8.&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.872885&amp;#93;&lt;/span&gt; LDISKFS-fs error (device dm-22): ldiskfs_journal_start_sb: Detected aborted journal&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.890598&amp;#93;&lt;/span&gt; Kernel panic - not syncing: LDISKFS-fs panic from previous error&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.890599&amp;#93;&lt;/span&gt; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.908627&amp;#93;&lt;/span&gt;  &amp;lt;2&amp;gt;LDISKFS-fs error (device dm-22): ldiskfs_journal_start_sb: Detected aborted journal&lt;br/&gt;
Dec 19 21:53:09 widow-oss13c2 kernel: [  886.650598] ldiskfs created from ext4-2.6-rhel5&lt;br/&gt;
Dec 19 21:53:35 widow-oss13c2 kernel: [  912.178692] LDISKFS-fs warning (device dm-22): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure&lt;br/&gt;
Dec 19 21:53:35 widow-oss13c2 kernel: [  912.178699] LDISKFS-fs warning (device dm-22): ldiskfs_clear_journal_err: Marking fs in need of filesystem check.&lt;br/&gt;
Dec 19 21:53:35 widow-oss13c2 kernel: [  912.208039] LDISKFS-fs (dm-22): warning: mounting fs with errors, running e2fsck is recommended&lt;br/&gt;
Dec 19 21:53:35 widow-oss13c2 kernel: [  912.263354] LDISKFS-fs (dm-22): recovery complete&lt;br/&gt;
Dec 19 21:53:35 widow-oss13c2 kernel: [  912.303995] LDISKFS-fs (dm-22): mounted filesystem with ordered data mode&lt;/p&gt;

&lt;p&gt;      KERNEL: /usr/gedi/nfsroot/prod_lustre/usr/lib/debug/lib/modules/2.6.18-348.3.1.el5.widow/vmlinux&lt;br/&gt;
    DUMPFILE: vmcore  &lt;span class=&quot;error&quot;&gt;&amp;#91;PARTIAL DUMP&amp;#93;&lt;/span&gt;&lt;br/&gt;
        CPUS: 8&lt;br/&gt;
        DATE: Thu Dec 19 21:26:53 2013&lt;br/&gt;
      UPTIME: 51 days, 10:01:45&lt;br/&gt;
LOAD AVERAGE: 28.90, 20.78, 19.66&lt;br/&gt;
       TASKS: 1279&lt;br/&gt;
    NODENAME: widow-oss13c2&lt;br/&gt;
     RELEASE: 2.6.18-348.3.1.el5.widow&lt;br/&gt;
     VERSION: #1 SMP Thu Mar 28 08:55:37 EDT 2013&lt;br/&gt;
     MACHINE: x86_64  (2327 Mhz)&lt;br/&gt;
      MEMORY: 15.8 GB&lt;br/&gt;
       PANIC: &quot;&lt;span class=&quot;error&quot;&gt;&amp;#91;4432932.890598&amp;#93;&lt;/span&gt; Kernel panic - not syncing: LDISKFS-fs panic from previous error&quot;&lt;br/&gt;
         PID: 16006&lt;br/&gt;
     COMMAND: &quot;ll_ost_io_171&quot;&lt;br/&gt;
        TASK: ffff81027db5f7e0  &lt;span class=&quot;error&quot;&gt;&amp;#91;THREAD_INFO: ffff81027db62000&amp;#93;&lt;/span&gt;&lt;br/&gt;
         CPU: 5&lt;br/&gt;
       STATE: TASK_RUNNING (PANIC)&lt;/p&gt;

&lt;p&gt;crash&amp;gt; bt&lt;br/&gt;
PID: 16006  TASK: ffff81027db5f7e0  CPU: 5   COMMAND: &quot;ll_ost_io_171&quot;&lt;br/&gt;
 #0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63448&amp;#93;&lt;/span&gt; crash_kexec at ffffffff800b80df&lt;br/&gt;
 #1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63508&amp;#93;&lt;/span&gt; panic at ffffffff80099933&lt;br/&gt;
 #2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db635f8&amp;#93;&lt;/span&gt; ldiskfs_abort at ffffffff88a4b6d5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db636f8&amp;#93;&lt;/span&gt; ldiskfs_journal_start_sb at ffffffff88a4b788 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63708&amp;#93;&lt;/span&gt; fsfilt_ldiskfs_brw_start at ffffffff88abddca &lt;span class=&quot;error&quot;&gt;&amp;#91;fsfilt_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db637a8&amp;#93;&lt;/span&gt; filter_commitrw_write at ffffffff88afa00f &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63988&amp;#93;&lt;/span&gt; filter_commitrw at ffffffff88af21d8 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63a38&amp;#93;&lt;/span&gt; ost_brw_write at ffffffff88a9ccaf &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63c58&amp;#93;&lt;/span&gt; ost_handle at ffffffff88aa0051 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63e08&amp;#93;&lt;/span&gt; ptlrpc_server_handle_request at ffffffff88815940 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63eb8&amp;#93;&lt;/span&gt; ptlrpc_main at ffffffff88816ff6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff81027db63f48&amp;#93;&lt;/span&gt; kernel_thread at ffffffff80061fc1&lt;br/&gt;
crash&amp;gt; &lt;/p&gt;



&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@widow-oss13c2 ~&amp;#93;&lt;/span&gt;# e2fsck -f $lun&lt;br/&gt;
e2fsck 1.42.3.wc3 (15-Aug-2012)&lt;br/&gt;
Pass 1: Checking inodes, blocks, and sizes&lt;br/&gt;
Pass 2: Checking directory structure&lt;br/&gt;
Pass 3: Checking directory connectivity&lt;br/&gt;
Pass 4: Checking reference counts&lt;br/&gt;
Pass 5: Checking group summary information&lt;br/&gt;
Free blocks count wrong (122969124, counted=122938260).&lt;br/&gt;
Fix&amp;lt;y&amp;gt;? yes&lt;/p&gt;

&lt;p&gt;widow3-OST0039: ***** FILE SYSTEM WAS MODIFIED *****&lt;br/&gt;
widow3-OST0039: 1524011/471982080 files (6.4% non-contiguous), 1764983916/1887922176 blocks&lt;/p&gt;
</comment>
                            <comment id="73974" author="ezell" created="Fri, 20 Dec 2013 23:11:39 +0000"  >&lt;p&gt;So this tells us (as we&apos;ve seen before) that part of the inode table is marked as unallocated.  It&apos;s not clear how unallocated it is (of the 512 blocks)... whether there&apos;s a single unallocated block or if the whole thing is zeroed.  It would be nice to see the value of next_zero_bit.&lt;/p&gt;

&lt;p&gt;Ideally, this would have crashed in ldiskfs_valid_block_bitmap() instead of marking the error and then crashing in ldiskfs_journal_start_sb().  Then we could see what function called ldiskfs_read_block_bitmap(), which might be useful for tracing back the in-memory corruption.  I guess for consistency sake, you want to cleanup a bit before panic&apos;ing.&lt;/p&gt;

&lt;p&gt;I&apos;m not that familiar with the &apos;crash&apos; utility.  This block bitmap should be in cache somewhere; how do I find it?&lt;/p&gt;</comment>
                            <comment id="74021" author="hongchao.zhang" created="Mon, 23 Dec 2013 10:00:20 +0000"  >&lt;p&gt;Is the content of the block bitmap buffer not printed as it was in the debug path?&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        printk(KERN_ERR&lt;span class=&quot;code-quote&quot;&gt;&quot;block bitmap of block_group %d : \n&quot;&lt;/span&gt;, block_group);
	&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (i = 0; i &amp;lt; (sb-&amp;gt;s_blocksize &amp;gt;&amp;gt; 3); i++) {
		printk(KERN_ERR&lt;span class=&quot;code-quote&quot;&gt;&quot;%016lx &quot;&lt;/span&gt;, *(((&lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;*)bh-&amp;gt;b_data) + i));
		&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (i &amp;amp;&amp;amp;  ((i % 4) == 0))
			printk(KERN_ERR&lt;span class=&quot;code-quote&quot;&gt;&quot;\n&quot;&lt;/span&gt;);
	}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;could you please look at the syslog file to check whether this info was contained in it or not? Thanks.&lt;/p&gt;</comment>
                            <comment id="74034" author="ezell" created="Mon, 23 Dec 2013 16:25:21 +0000"  >&lt;p&gt;No, that message is not on the console or syslog.  I &lt;b&gt;think&lt;/b&gt; ldiskfs_error (ext4_error) aborted the journal and then a &lt;b&gt;different&lt;/b&gt; thread noticed the journal was aborted and panic()ed the node.  It might be nice to refresh the patch to printk the bitmap before calling ext_error().  Unfortunately, I don&apos;t think we will have an opportunity to reboot with an updated image on this file system.  And our new stuff (Atlas) was formatted with Lustre 2.4, so it should have flex_bg.  I worry that whatever is causing this &lt;b&gt;might&lt;/b&gt; still be present in newer versions of ext4/Lustre, but flex_bg will prevent us from noticing right away.&lt;/p&gt;

&lt;p&gt;Looking in the crash dump, I have the following process that I &lt;b&gt;think&lt;/b&gt; hit the error:&lt;/p&gt;

&lt;p&gt; PID: 16218  TASK: ffff8102674b1820  CPU: 0   COMMAND: &quot;ll_ost_io_383&quot;&lt;br/&gt;
 #0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c2e60&amp;#93;&lt;/span&gt; schedule at ffffffff80066fd0&lt;br/&gt;
 #1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c2f38&amp;#93;&lt;/span&gt; io_schedule at ffffffff8006780f&lt;br/&gt;
 #2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c2f58&amp;#93;&lt;/span&gt; sync_buffer at ffffffff80015c90&lt;br/&gt;
 #3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c2f68&amp;#93;&lt;/span&gt; __wait_on_bit at ffffffff80067a68&lt;br/&gt;
 #4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c2fa8&amp;#93;&lt;/span&gt; out_of_line_wait_on_bit at ffffffff80067b08&lt;br/&gt;
 #5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3018&amp;#93;&lt;/span&gt; __wait_on_buffer at ffffffff8004c94f&lt;br/&gt;
 #6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3028&amp;#93;&lt;/span&gt; sync_dirty_buffer at ffffffff8003c7db&lt;br/&gt;
 #7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3048&amp;#93;&lt;/span&gt; jbd2_journal_update_superblock at ffffffff88a0ea14 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3088&amp;#93;&lt;/span&gt; __journal_abort_soft at ffffffff88a0ecdb &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c30a8&amp;#93;&lt;/span&gt; jbd2_journal_abort at ffffffff88a0ece9 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c30b8&amp;#93;&lt;/span&gt; ldiskfs_handle_error at ffffffff88a4aea5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c30d8&amp;#93;&lt;/span&gt; __ldiskfs_error at ffffffff88a4b032 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#12 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c31d8&amp;#93;&lt;/span&gt; ldiskfs_read_block_bitmap at ffffffff88a248c6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#13 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3268&amp;#93;&lt;/span&gt; ldiskfs_mb_mark_diskspace_used at ffffffff88a3e813 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#14 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c32d8&amp;#93;&lt;/span&gt; ldiskfs_mb_new_blocks at ffffffff88a3ee35 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#15 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3388&amp;#93;&lt;/span&gt; ldiskfs_ext_new_extent_cb at ffffffff88abe255 &lt;span class=&quot;error&quot;&gt;&amp;#91;fsfilt_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#16 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3488&amp;#93;&lt;/span&gt; ldiskfs_ext_walk_space at ffffffff88a27e63 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#17 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3518&amp;#93;&lt;/span&gt; fsfilt_map_nblocks at ffffffff88aba42f &lt;span class=&quot;error&quot;&gt;&amp;#91;fsfilt_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#18 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c35c8&amp;#93;&lt;/span&gt; fsfilt_ldiskfs_map_ext_inode_pages at ffffffff88aba65a &lt;span class=&quot;error&quot;&gt;&amp;#91;fsfilt_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#19 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3688&amp;#93;&lt;/span&gt; fsfilt_ldiskfs_map_inode_pages at ffffffff88aba6c1 &lt;span class=&quot;error&quot;&gt;&amp;#91;fsfilt_ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
#20 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c36d8&amp;#93;&lt;/span&gt; filter_direct_io at ffffffff88af84f4 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;br/&gt;
#21 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c37a8&amp;#93;&lt;/span&gt; filter_commitrw_write at ffffffff88afa9f5 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;br/&gt;
#22 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3988&amp;#93;&lt;/span&gt; filter_commitrw at ffffffff88af21d8 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;br/&gt;
#23 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3a38&amp;#93;&lt;/span&gt; ost_brw_write at ffffffff88a9ccaf &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
#24 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3c58&amp;#93;&lt;/span&gt; ost_handle at ffffffff88aa0051 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
#25 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3e08&amp;#93;&lt;/span&gt; ptlrpc_server_handle_request at ffffffff88815940 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
#26 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3eb8&amp;#93;&lt;/span&gt; ptlrpc_main at ffffffff88816ff6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
#27 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8102674c3f48&amp;#93;&lt;/span&gt; kernel_thread at ffffffff80061fc1&lt;/p&gt;
</comment>
                            <comment id="74038" author="ezell" created="Mon, 23 Dec 2013 16:49:00 +0000"  >&lt;p&gt;I was able to find the buffer_head for block 540311552:&lt;/p&gt;

&lt;p&gt;crash&amp;gt; struct buffer_head ffff8103a942c430&lt;br/&gt;
struct buffer_head {&lt;br/&gt;
  b_state = 9191465, &lt;br/&gt;
  b_this_page = 0xffff8103a942c430, &lt;br/&gt;
  b_page = 0xffff810101f8d820, &lt;br/&gt;
  b_blocknr = 540311552, &lt;br/&gt;
  b_size = 4096, &lt;br/&gt;
  b_data = 0xffff8100903dc000 &quot;\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\003&quot;, &lt;br/&gt;
  b_bdev = 0xffff8104191c1140, &lt;br/&gt;
  b_end_io = 0xffffffff80033b32 &amp;lt;end_buffer_read_sync&amp;gt;, &lt;br/&gt;
  b_private = 0xffff8102d91a3370, &lt;br/&gt;
  b_assoc_buffers = &lt;/p&gt;
{
    next = 0xffff8103a942c478, 
    prev = 0xffff8103a942c478
  }
&lt;p&gt;, &lt;br/&gt;
  b_count = &lt;/p&gt;
{
    counter = 3
  }
&lt;p&gt;}&lt;/p&gt;</comment>
                            <comment id="74082" author="hongchao.zhang" created="Thu, 26 Dec 2013 03:16:38 +0000"  >&lt;p&gt;as per the b_data, there is no zero bit in the first (512+2) bits?&lt;br/&gt;
and the count of free blocks in problem are 30864(122969124 - 122938260), which is about the capacity of one block group 32768.&lt;/p&gt;

&lt;p&gt;the patch is updated to print the content of the bitmap block from &lt;/p&gt;</comment>
                            <comment id="83101" author="pjones" created="Fri, 2 May 2014 20:22:17 +0000"  >&lt;p&gt;As per DDN this ticket is no longer relevant. They will open a new ticket if this ever occurs again&lt;/p&gt;</comment>
                            <comment id="101792" author="gerrit" created="Wed, 17 Dec 2014 08:15:36 +0000"  >&lt;p&gt;Shilong Wang (wshilong@ddn.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13100&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13100&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-486&quot; title=&quot;ldiskfs_valid_block_bitmap: Invalid block bitmap&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-486&quot;&gt;&lt;del&gt;LU-486&lt;/del&gt;&lt;/a&gt; ldiskfs: print debug info for invalid block bitmap&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_1&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 669b6787b821f769b4112b37620efc9297bc69ea&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11639" name="hamster8.log" size="14570" author="rjh" created="Thu, 21 Jun 2012 02:52:31 +0000"/>
                            <attachment id="10721" name="messages.20120105.bz2" size="141975" author="dvasil@ddn.com" created="Fri, 6 Jan 2012 13:31:12 +0000"/>
                            <attachment id="13016" name="widow2-e2fsck.log.clean.txt" size="82282" author="hilljjornl" created="Mon, 10 Jun 2013 15:20:03 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                    <customfield id="customfield_10020" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Bugzilla ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>23959.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvf87:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6112</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>