<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:10:13 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-767] OSS takes over all resources and STONITHs it&apos;s cluster partner without any warning</title>
                <link>https://jira.whamcloud.com/browse/LU-767</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We experienced an issue where an OSS took over the resources of it&apos;s cluster partner and then initiated a STONITH event and reset the partner without any notification that the event was going to happen. According to our log files for the network, the MDS, and the OSS&apos;, there appeared to be nothing wrong with the OSS that was reset. We are trying to determine why this may have happened and would like to request any assistance you may be able to provide.  We have the ha.cf file configured as follows sending unicast packets only between the cluster partners:&lt;/p&gt;

&lt;p&gt;keepalive 6&lt;br/&gt;
warntime 30&lt;br/&gt;
deadtime 90&lt;br/&gt;
initdead 180&lt;/p&gt;

&lt;p&gt;The keepalives are being sent through two interfaces. One is through an Infiniband switch and the other is a direct ethernet connection using a crossover cable between the two devices.&lt;/p&gt;

&lt;p&gt;Would upgrading to the most current Linux-HA version possibly remediate this issue or do you think it would cause other issues due to the current Lustre and Linux OS versions we are using?  &lt;/p&gt;

&lt;p&gt;Any assistance you can provide would be greatly appreciated.  Thank you.&lt;/p&gt;</description>
                <environment>Lustre 1.8.0.1 running on Red Hat Linux Enterprise 5.3 using Linux High Availability Heartbeat version 2.1.4-4.1</environment>
        <key id="12150">LU-767</key>
            <summary>OSS takes over all resources and STONITHs it&apos;s cluster partner without any warning</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="martindw1">David Martin</reporter>
                        <labels>
                    </labels>
                <created>Mon, 17 Oct 2011 12:43:50 +0000</created>
                <updated>Thu, 16 Feb 2012 14:48:06 +0000</updated>
                            <resolved>Thu, 16 Feb 2012 14:48:06 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="21354" author="pjones" created="Mon, 17 Oct 2011 12:48:01 +0000"  >&lt;p&gt;Cliff&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="21355" author="martindw1" created="Mon, 17 Oct 2011 12:57:37 +0000"  >&lt;p&gt;Thanks Peter!&lt;/p&gt;
</comment>
                            <comment id="21364" author="cliffw" created="Mon, 17 Oct 2011 14:27:16 +0000"  >&lt;p&gt;The current Linux-HA should be fine. Lustre requires nothing especially fancy from failover, we work with anything that can manage a Filesystem resource.&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Did you have any other monitoring setup beyond the Heartbeat pinger?&lt;/li&gt;
	&lt;li&gt;Did you check the linux-ha logs? There should be an indication there as to what happened.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="21462" author="martindw1" created="Tue, 18 Oct 2011 18:54:30 +0000"  >&lt;p&gt;Cliff,&lt;/p&gt;

&lt;p&gt;We do not have any other monitoring setup beyond Heartbeat.&lt;/p&gt;

&lt;p&gt;We did check the linux-ha logs and there was no indication of what happened. However, I&apos;m not so sure we have our logging level set that would provide a verbose enough log.  Do you have a recommendation as to what level we should be logging?&lt;/p&gt;

&lt;p&gt;Dave&lt;/p&gt;</comment>
                            <comment id="21499" author="cliffw" created="Wed, 19 Oct 2011 12:51:57 +0000"  >&lt;p&gt;I used to use level 3 as default for debug logs, basically I would tail -f the log file, and adjust according to how much goop is spewed.&lt;br/&gt;
How often is this failover occurring? Any other events going on at that time? Any indication of a network hiccup from other nodes?&lt;br/&gt;
If Heartbeat pinger is the only monitoring, than either pinger failed (which should put something in logs)  or somehow a takeover was ordered (which should also show in logs) &lt;br/&gt;
Have you checked the system log on the node issuing the STONITH? There should be something there.&lt;/p&gt;</comment>
                            <comment id="27384" author="cliffw" created="Wed, 25 Jan 2012 13:32:58 +0000"  >&lt;p&gt;What is your current state? Is this still an issue?&lt;/p&gt;</comment>
                            <comment id="28898" author="martindw1" created="Thu, 16 Feb 2012 11:03:54 +0000"  >&lt;p&gt;Cliff,&lt;/p&gt;

&lt;p&gt;We have not had a re-occurrence of this issue. We believe it may be a stability issue with 1.8.0.1 and RedHat 5.3.  We are looking at upgrading to 1.8.6.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;br/&gt;
Dave&lt;/p&gt;
</comment>
                            <comment id="29071" author="cliffw" created="Thu, 16 Feb 2012 14:47:59 +0000"  >&lt;p&gt;Okay, I will close this issue, please re-open if you have further information.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvhuv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6542</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>