<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:20:46 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1910] OSS kernel panics after upgrade</title>
                <link>https://jira.whamcloud.com/browse/LU-1910</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Since our recent upgrade to 1.8.8, we&apos;ve been experiencing problems with the md subsystem. Our OSTs are constructed as 8+2 RAID6 metadevices using the mdadm utility. &lt;br/&gt;
Every Sunday morning, cron.weekly runs the raid.check scripts and starts re-syncing and if it hits a medium error, the md subsytem hangs, for example &quot;cat /proc/mdstat&quot; hangs. The load on the server immediately starts going up until the server becomes unusable and we have to reboot the OSS server &lt;br/&gt;
What could be causing this and should we be running raid.check on the ost metadevices?&lt;/p&gt;</description>
                <environment>Sun Fire x4540 server, 48 internal 1TB disks, lustre patched kernel - kernel-2.6.18-308.4.1.el5, Lustre 1.8.8</environment>
        <key id="15909">LU-1910</key>
            <summary>OSS kernel panics after upgrade</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="hellenn">Hellen</reporter>
                        <labels>
                    </labels>
                <created>Wed, 12 Sep 2012 11:25:48 +0000</created>
                <updated>Sat, 8 Mar 2014 00:57:24 +0000</updated>
                            <resolved>Sat, 8 Mar 2014 00:57:24 +0000</resolved>
                                    <version>Lustre 1.8.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="44708" author="pjones" created="Wed, 12 Sep 2012 14:58:01 +0000"  >&lt;p&gt;Oleg will help with this one&lt;/p&gt;</comment>
                            <comment id="44710" author="green" created="Wed, 12 Sep 2012 15:02:14 +0000"  >&lt;p&gt;It looks like you have hit two problems at once (related).&lt;br/&gt;
Problem #1 - your disk at 5:0:0:0 (/dev/sdao if we believe dmesg) have gone bad (there&apos;s a clear read error we can see in the log at the start of it all).&lt;br/&gt;
Problem #2, either due to a controller, a driver bug or a combination of both the controller is wedged on error and cannot access anything anymore so you see a wide hang (sadly this is not all that infrequent, I have numerous semi-highend nodes under my control that have these problems too - disk controller hangs on disk errors, though not with all kernels).&lt;/p&gt;

&lt;p&gt;I guess the most important for you right now is to swap out the bad drive and rebuild your raid. Havign that bad disk separated you can plug it into a testing node and find if there&apos;s any kernel that does not hang the controller on read error and contact maintainers of the driver/redhat with the info.&lt;/p&gt;</comment>
                            <comment id="44725" author="hellenn" created="Wed, 12 Sep 2012 20:26:15 +0000"  >&lt;p&gt;Thanks for your response. In the mean time would you recommend we disable the cron.weekly raid.check. So far, the hangs only occur if a disk error is discovered during the check. &lt;/p&gt;</comment>
                            <comment id="44729" author="green" created="Wed, 12 Sep 2012 21:10:36 +0000"  >&lt;p&gt;Well, sure disabling raid check on that particular array temporarily is a good idea until you can replace/fix the bad drive.&lt;/p&gt;

&lt;p&gt;Just be aware that it does not fix anything, it&apos;s papering over the real issue (which you did not hit either because you did not access the file stored there yet, or because the bad block is in unused space. Basically due to luck, if you try to read entire disk, you&apos;ll still hit this).&lt;/p&gt;

&lt;p&gt;Seeing as how it&apos;s just a read error, it might be the case of &quot;bitrot&quot; where a track on disk just develops a read error because some bits change and CRC no longer matches, but the actual track is still good, those you can usually fix by just writing over bad spot location, easiest one being just to kick the bad drive out of the array, wiping raid superblock and then readding it back where reconstruction process will write the entire disk over, including the problematic area. I have multiple drives that were &quot;healed&quot; by this process.&lt;/p&gt;

&lt;p&gt;But if you actually care about high availability, if I were you, I&apos;d actually pull the disk, replace it with a spare one and then experiment with the controller drivers in order to see if there is a stabler version on a test box. Otherwise next time some other disk goes bad you&apos;ll have controller/driver freeze again which is not a very good thing for availability reasons.&lt;/p&gt;</comment>
                            <comment id="78786" author="jfc" created="Sat, 8 Mar 2014 00:57:24 +0000"  >&lt;p&gt;Solution and guidance provided to customer. &lt;br/&gt;
No need to keep this ticket unresolved any longer.&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11844" name="oss06_messages" size="9246148" author="hellenn" created="Wed, 12 Sep 2012 11:25:48 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw3g7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10643</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>