<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:17:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1543] Lustre Servers - MDS / OSS Died &amp; fail over took over</title>
                <link>https://jira.whamcloud.com/browse/LU-1543</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Dear Support, &lt;br/&gt;
  during the weeks we had problem with the MDS/OSS Lustre servers, indeed last week one of the OSS died and fortunately the fail over took over and last night both MDS (see log weissh01.log) and another OSS (see log weiss11.log) died, also in that case the fail over servers took over. &lt;br/&gt;
The problem of yesterday looks related between the two servers, indeed they basically died at the same time around 19.20-30  &lt;/p&gt;

&lt;p&gt;As I said the file system remained up and running because the fail over servers took over, but with our old lustre configuration (version 1.8.7 &amp;#8211; 1 Mds + 1 Mds Fail over &amp;#8211; 4 Oss) also under huge stress and a lot of logging of slow down due to heavy IO load, the  MDS or OSS didn&apos;t died.&lt;/p&gt;

&lt;p&gt;If you need access to our cluster, please let me know (fverzell@cscs.ch) so that we can organize to create an account.&lt;/p&gt;

&lt;p&gt;Right now we have also a list of ticket that might be related to each other in same aspect, that&apos;s the list:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1447&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-1447&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1455&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-1455&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1470&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-1470&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1503&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-1503&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Regards&lt;br/&gt;
Fabio&lt;/p&gt;</description>
                <environment>MDS HW &lt;br/&gt;
---------------------------------------------------------------------------------------------------- &lt;br/&gt;
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 &lt;br/&gt;
Architecture: x86_64 &lt;br/&gt;
CPU op-mode(s): 32-bit, 64-bit &lt;br/&gt;
Byte Order: Little Endian &lt;br/&gt;
CPU(s): 16 &lt;br/&gt;
Vendor ID: AuthenticAMD &lt;br/&gt;
CPU family: 16 &lt;br/&gt;
64Gb RAM &lt;br/&gt;
Interconnect IB 40Gb/s &lt;br/&gt;
&lt;br/&gt;
MDT LSI 5480 Pikes Peak &lt;br/&gt;
SSDs SLC &lt;br/&gt;
---------------------------------------------------------------------------------------------------- &lt;br/&gt;
&lt;br/&gt;
OSS HW &lt;br/&gt;
---------------------------------------------------------------------------------------------------- &lt;br/&gt;
Architecture: x86_64 &lt;br/&gt;
CPU op-mode(s): 32-bit, 64-bit &lt;br/&gt;
Byte Order: Little Endian &lt;br/&gt;
CPU(s): 32 &lt;br/&gt;
Vendor ID: GenuineIntel &lt;br/&gt;
CPU family: 6 &lt;br/&gt;
64Gb RAM &lt;br/&gt;
Interconnect IB 40Gb/s &lt;br/&gt;
&lt;br/&gt;
OST LSI 7900 &lt;br/&gt;
---------------------------------------------------------------------------------------------------- &lt;br/&gt;
&lt;br/&gt;
Router nodes &lt;br/&gt;
------------------- &lt;br/&gt;
12 router nodes - IB 40Gb/s &lt;br/&gt;
&lt;br/&gt;
Clients &lt;br/&gt;
--------- &lt;br/&gt;
Cray XE6 - Lustre 1.8.6 &lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
1 MDS + 1 fail over &lt;br/&gt;
12 OSS - 6 OST per OSS</environment>
        <key id="14982">LU-1543</key>
            <summary>Lustre Servers - MDS / OSS Died &amp; fail over took over</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="fverzell">Fabio Verzelloni</reporter>
                        <labels>
                    </labels>
                <created>Wed, 20 Jun 2012 02:27:49 +0000</created>
                <updated>Mon, 10 Sep 2012 13:28:35 +0000</updated>
                            <resolved>Mon, 10 Sep 2012 13:28:35 +0000</resolved>
                                    <version>Lustre 2.2.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="40933" author="pjones" created="Wed, 20 Jun 2012 11:17:05 +0000"  >&lt;p&gt;Fabio&lt;/p&gt;

&lt;p&gt;We will definitely take an overall view of all your issues when deciding the best approach. Getting remote access to the cluster in question will undoubtedly be useful. I will contact you directly to make those arrangements&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="40960" author="cliffw" created="Wed, 20 Jun 2012 20:23:33 +0000"  >&lt;p&gt;Can we get a list of the address for the Lustre servers? &lt;/p&gt;</comment>
                            <comment id="40973" author="liang" created="Thu, 21 Jun 2012 03:42:57 +0000"  >&lt;p&gt;I think it could be a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1280&quot; title=&quot;kernel BUG at .../lustre-2.1.0/lustre/lvfs/fsfilt-ldiskfs.c:978&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1280&quot;&gt;&lt;del&gt;LU-1280&lt;/del&gt;&lt;/a&gt; and fixes:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,2452&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,2452&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#change,2827&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,2827&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="40975" author="fverzell" created="Thu, 21 Jun 2012 04:11:33 +0000"  >&lt;p&gt;The list of the Lustre servers is the following:&lt;/p&gt;

&lt;p&gt;MDS + Failover&lt;br/&gt;
Weisshorn01- 148.187.7.101&lt;br/&gt;
Weisshorn02- 148.187.7.102&lt;/p&gt;

&lt;p&gt;OSS + each couple is the failover (weisshorn03-04, 05-06, ecc..)&lt;br/&gt;
Weisshorn03- 148.187.7.103&lt;br/&gt;
Weisshorn04- 148.187.7.104&lt;br/&gt;
Weisshorn05- 148.187.7.105&lt;br/&gt;
Weisshorn06- 148.187.7.106&lt;br/&gt;
Weisshorn07- 148.187.7.107&lt;br/&gt;
Weisshorn08- 148.187.7.108&lt;br/&gt;
Weisshorn09- 148.187.7.109&lt;br/&gt;
Weisshorn10- 148.187.7.110&lt;br/&gt;
Weisshorn11- 148.187.7.111&lt;br/&gt;
Weisshorn12- 148.187.7.112&lt;br/&gt;
Weisshorn13- 148.187.7.113&lt;br/&gt;
Weisshorn14- 148.187.7.114&lt;/p&gt;
</comment>
                            <comment id="41501" author="cliffw" created="Thu, 5 Jul 2012 14:28:37 +0000"  >&lt;p&gt;Has there been any word from Cray on access to gnilnd source? Should we close this issue and revisit after the software version change planned (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1503&quot; title=&quot;Clients application IO errors and overloaded system messages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1503&quot;&gt;&lt;del&gt;LU-1503&lt;/del&gt;&lt;/a&gt;) ?&lt;br/&gt;
Please let us know what more we can do to assist.&lt;/p&gt;</comment>
                            <comment id="41552" author="fverzell" created="Fri, 6 Jul 2012 03:20:17 +0000"  >&lt;p&gt;Cliff, &lt;br/&gt;
  we passing the request to CRAY to see if can manage to have access to the code. I&apos;ll let you ASAP.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Fabio&lt;/p&gt;</comment>
                            <comment id="41567" author="spitzcor" created="Fri, 6 Jul 2012 14:30:20 +0000"  >&lt;p&gt;FYI, Cray is working on pushing up the gnilnd into the Lustre tree.  The tracking ticket is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1419&quot; title=&quot;Tracking ticket for gnilnd push&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1419&quot;&gt;&lt;del&gt;LU-1419&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="43938" author="simmonsja" created="Wed, 29 Aug 2012 09:46:31 +0000"  >&lt;p&gt;Any updates?&lt;/p&gt;</comment>
                            <comment id="43943" author="spitzcor" created="Wed, 29 Aug 2012 10:38:09 +0000"  >&lt;p&gt;Well, the Cray LND code has been pushed to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1419&quot; title=&quot;Tracking ticket for gnilnd push&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1419&quot;&gt;&lt;del&gt;LU-1419&lt;/del&gt;&lt;/a&gt;, but not landed.  I can&apos;t help with any other updates.&lt;/p&gt;</comment>
                            <comment id="43948" author="simmonsja" created="Wed, 29 Aug 2012 10:56:31 +0000"  >&lt;p&gt;I mean does Fabio still see the problem.&lt;/p&gt;</comment>
                            <comment id="44173" author="cliffw" created="Tue, 4 Sep 2012 17:11:09 +0000"  >&lt;p&gt;What is the current state? Is there anything more we can do on this issue?&lt;/p&gt;</comment>
                            <comment id="44520" author="cliffw" created="Mon, 10 Sep 2012 13:28:35 +0000"  >&lt;p&gt;I am going to close this issue. Please re-open if you have more information or questions.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11633" name="19_jun.log" size="1900234" author="fverzell" created="Wed, 20 Jun 2012 02:27:49 +0000"/>
                            <attachment id="11636" name="20_jun.log" size="939897" author="fverzell" created="Wed, 20 Jun 2012 02:27:49 +0000"/>
                            <attachment id="11634" name="weiss01.log" size="185117" author="fverzell" created="Wed, 20 Jun 2012 02:27:49 +0000"/>
                            <attachment id="11635" name="weiss11.log" size="58118" author="fverzell" created="Wed, 20 Jun 2012 02:27:49 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvguf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6378</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>