<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:05:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-269] Ldisk-fs error (device md41): ldiskfs_check_descriptors: Block bitma p for group 0 not in group</title>
                <link>https://jira.whamcloud.com/browse/LU-269</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>
&lt;p&gt;MDS1/2 are running and active/active cluster configuration. mgs resides on MDS1 and mdt resides on MDS2. Is this an OK way to do things or not? In the original architecture of the system Oracle stated this was supported, but the customer just found out from Tyian over at Oracle that this is not a supported configuration for the MDS devices.&lt;/p&gt;

&lt;p&gt;As of last Thursday customer could see all raid devices by the OS, but for some reason OST11 just simply would not become available. That issue went away with their &quot;bare metal&quot; reboot of the system on Friday morning. What started however, we have yet to fix:&lt;/p&gt;

&lt;p&gt;OST15 /dev/md41 resident on OSS4&lt;br/&gt;
From /var/log/messages&lt;br/&gt;
Ldisk-fs error (device md41): ldiskfs_check_descriptors: Block bitma p for group 0 not in group&lt;/p&gt;

&lt;p&gt;This device is not mounting anymore to the OSS at the Operating System level. The raid device can be constructed, but Lustre will not mount it.&lt;/p&gt;

&lt;p&gt;Customer contact for questions is tyler.s.wiegers@lmco.com&lt;/p&gt;
</description>
                <environment>RHEL 5.5 and Lustre 1.8.0.1</environment>
        <key id="10697">LU-269</key>
            <summary>Ldisk-fs error (device md41): ldiskfs_check_descriptors: Block bitma p for group 0 not in group</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="dferber">Dan Ferber</reporter>
                        <labels>
                    </labels>
                <created>Tue, 3 May 2011 10:59:32 +0000</created>
                <updated>Tue, 28 Jun 2011 15:01:39 +0000</updated>
                            <resolved>Mon, 9 May 2011 07:48:01 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="13586" author="tyler.s.wiegers@lmco.com" created="Tue, 3 May 2011 12:47:33 +0000"  >&lt;p&gt;Environment should be RHEL 5.3, not 5.5.&lt;/p&gt;

&lt;p&gt;We did a clean system startup, cleared messages, and systematically assembled the bitmaps, mounted them, assembled the raid devices, and mounted them for each OST.  There were no errors for this OST until attempting to mount the raid device.  All other OST&apos;s mounted successfully.  &lt;/p&gt;

&lt;p&gt;The unique data point for this OST is that it&apos;s raid device is missing a disk (md41).  The disk was reported as unknown after our CAM/firmware upgrades yesterday so we replaced it, but we did not re-insert it into the raid.  Would that situation cause the errors that we currently see?&lt;/p&gt;

&lt;p&gt;The log output is the following:&lt;/p&gt;

&lt;p&gt;oss3# mount -t lustre /dev/md41 /mnt/lustre_ost15&lt;br/&gt;
mount.lustre: mount /dev/md41 at /mnt/lustre_ost15 failed: Invalid argument&lt;br/&gt;
This may have multiple causes.&lt;br/&gt;
Are the mount options correct?&lt;br/&gt;
Check the syslog for more info.&lt;/p&gt;

&lt;p&gt;/var/log/messages output (trimmed dates/times, I had to re-type this in from hard copy):&lt;/p&gt;

&lt;p&gt;LDISKFS-fs error (device md41): ldiskf_check_descriptors: Block bitmap for group 0 not in group (block 134217728)!&lt;br/&gt;
LDISKFS-fs: group descriptors corrupted&lt;br/&gt;
LustreError: 26127:0:(obd_mount.c:1278:server_kernel_mount()) premount /dev/md41:0x0 ldiskfs failed: -22, ldiskfs2 failed: -19.  Is the ldiskfs module available?&lt;br/&gt;
LustreError: 26127:0:(obd_mount.c:1278:server_kernel_mount()) Skipped 2 previous similar messages&lt;br/&gt;
LustreError: 26127:0:(obd_mount.c:1590:server_fill_super()) Unable to mount device /dev/md41: -22&lt;br/&gt;
LustreError: 26127:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount  (-22)&lt;/p&gt;



&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="13602" author="cliffw" created="Tue, 3 May 2011 13:53:51 +0000"  >&lt;p&gt;Well,a missing disk without a spare would definitely mess up the raid, i would think.&lt;/p&gt;

&lt;p&gt;Was there data on the missing spindle? Has that been recovered? &lt;br/&gt;
First, you need to be sure the md41 device is intact and that the local (ldiskfs) filesystem &lt;br/&gt;
on that device is intact. If that requires re-inserting the disk, you should do so. &lt;/p&gt;

&lt;p&gt;After the md side is healthy, you should run &apos;fsck -fn&apos; on md41 and see what that reports. &lt;br/&gt;
Be sure you have the proper Lustre-aware version of e2fsprogs. (you should have downloaded it &lt;br/&gt;
along with the rest of your Lustre RPMS) &lt;/p&gt;

&lt;p&gt;Assuming the md41 device is restored, &apos;fsck -fy&apos; may fix the bitmap issue, but run &apos;-fn&apos; first&lt;br/&gt;
-that is a read-only pass to test for errors. &lt;/p&gt;

&lt;p&gt;If there are other errors beyond the bitmap, you should attach the results here, but if you only&lt;br/&gt;
have the bitmap issue, proceed with the -fy&lt;/p&gt;</comment>
                            <comment id="13603" author="cliffw" created="Tue, 3 May 2011 13:56:00 +0000"  >&lt;p&gt;Also, your first question was un-related - If the MGT and MDT are separate partitions, it okay to have one node active for MGS and the other active for MDS in a failover pair - the MGS is really, really lightweight, so after client mount the MGS node should be more or less idle. &lt;/p&gt;</comment>
                            <comment id="13621" author="tyler.s.wiegers@lmco.com" created="Tue, 3 May 2011 15:48:43 +0000"  >&lt;p&gt;We inserted the disk back into the raid, it is currently rebuilding.  Trying to mount the OST while the disk is rebuilding gives the same error.  We&apos;ve been able to mount OST&apos;s while disks are rebuilding in the past, so the core issue doesn&apos;t look like it&apos;s resolved.&lt;/p&gt;

&lt;p&gt;No data should have been lost, we are running an 8+2 raid 6 device, so we can run with 8/10 disks without any data loss.&lt;/p&gt;</comment>
                            <comment id="13623" author="cliffw" created="Tue, 3 May 2011 15:57:48 +0000"  >&lt;p&gt;Okay, after the rebuild, please run &apos;fsck -fn&apos;&lt;/p&gt;</comment>
                            <comment id="13657" author="johann" created="Wed, 4 May 2011 06:53:41 +0000"  >&lt;p&gt;&amp;gt; We inserted the disk back into the raid, it is currently rebuilding. Trying to mount the OST&lt;br/&gt;
&amp;gt; while the disk is rebuilding gives the same error. We&apos;ve been able to mount OST&apos;s while disks&lt;br/&gt;
&amp;gt; are rebuilding in the past, so the core issue doesn&apos;t look like it&apos;s resolved.&lt;/p&gt;

&lt;p&gt;Based on the following comment:&lt;br/&gt;
&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-270?focusedCommentId=13656&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_13656&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-270?focusedCommentId=13656&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_13656&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Could you please tell us what version of the mptsas driver you use?&lt;br/&gt;
I think the bug was fixed in versions &amp;gt;= 4.18.20.02.&lt;/p&gt;

&lt;p&gt;&amp;gt; No data should have been lost, we are running an 8+2 raid 6 device, so we can run with 8/10 disks without any data loss.&lt;/p&gt;

&lt;p&gt;Right.&lt;/p&gt;</comment>
                            <comment id="13683" author="tyler.s.wiegers@lmco.com" created="Wed, 4 May 2011 13:39:28 +0000"  >&lt;p&gt;This disk finished rebuilding.&lt;/p&gt;

&lt;p&gt;Once rebuilt, we attemped to run an e2fsck on the disks, which failed due to the MMP flag being set.  We cleared the flag using tune2fs (ref &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-270&quot; title=&quot;LDisk-fs warning (device md30): ldisk_multi_mount_protect: fsck is running on filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-270&quot;&gt;&lt;del&gt;LU-270&lt;/del&gt;&lt;/a&gt;) and tried the e2fsck again, which failed with another error.  We ran tune2fs with the uninit-bg option, which allowed us to run e2fsck.  The e2fsck ran for about 3.5 hours before completing.&lt;/p&gt;

&lt;p&gt;Once the e2fsck was complete, we were able to successfully mount this OST, which is extremely good news.  There were multiple recovered files in lost+found which we will be attempting to recover.&lt;/p&gt;

&lt;p&gt;We are in the process of running e2fsck on all of our OSTs.  Once complete we are planning a complete power down of all OSS&apos;s, MDS&apos;s, and disk arrays in order to do a fresh clean startup.  I will update no later than tomorrow with hopefully a &lt;b&gt;problem resolved&lt;/b&gt; statement.&lt;/p&gt;</comment>
                            <comment id="13737" author="cliffw" created="Thu, 5 May 2011 09:16:25 +0000"  >&lt;p&gt;Great, thanks for keeping us updated&lt;br/&gt;
cliffw&lt;/p&gt;





&lt;p&gt;&amp;#8211; &lt;br/&gt;
cliffw&lt;br/&gt;
Support Guy&lt;br/&gt;
WhamCloud, Inc.&lt;br/&gt;
www.whamcloud.com&lt;/p&gt;</comment>
                            <comment id="14004" author="pjones" created="Mon, 9 May 2011 07:47:54 +0000"  >&lt;p&gt;Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw2un:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10544</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>