<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:29:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9765] NMI watchdog - OPA &lt;-&gt; IB LNET router</title>
                <link>https://jira.whamcloud.com/browse/LU-9765</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have now two routers, from IB to OPA clients, soak-14/15&lt;br/&gt;
After several hours of running, soak-15 crashed hard. Multiple NMI on multiple CPU.&lt;br/&gt;
Output is somewhat messy&lt;br/&gt;
Crash dump is available on the node. vmcore-dmesg and console log attached&lt;/p&gt;
</description>
                <environment>Soak performance cluster</environment>
        <key id="47205">LU-9765</key>
            <summary>NMI watchdog - OPA &lt;-&gt; IB LNET router</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Wed, 12 Jul 2017 15:07:19 +0000</created>
                <updated>Thu, 13 Jul 2017 21:19:45 +0000</updated>
                                            <version>Lustre 2.10.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="201847" author="pjones" created="Wed, 12 Jul 2017 16:12:36 +0000"  >&lt;p&gt;Amir is investigating&lt;/p&gt;</comment>
                            <comment id="201861" author="dmiter" created="Wed, 12 Jul 2017 16:52:03 +0000"  >&lt;p&gt;What version of IFS is installed? It looks very similar to issue of IFS which is call schedule() with lock acquired. This is fixed in IFS 10.4 version.&lt;/p&gt;</comment>
                            <comment id="201862" author="ashehata" created="Wed, 12 Jul 2017 16:57:31 +0000"  >&lt;p&gt;How do I find the IFS version installed?&lt;/p&gt;</comment>
                            <comment id="201873" author="dmiter" created="Wed, 12 Jul 2017 18:22:00 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;# cat /etc/opa/version_wrapper
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="202069" author="ashehata" created="Thu, 13 Jul 2017 19:41:22 +0000"  >&lt;p&gt;From the core it appears that cpt 1 net lock has been unlocked one too many times:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;crash&amp;gt; p* the_lnet.ln_net_lock-&amp;gt;pcl_locks[0]
$27 = {
  {
    rlock = {
      raw_lock = {
        {
          head_tail = 171313694, 
          tickets = {
            head = 2590, 
            tail = 2614
          }
        }
      }
    }
  }
}
crash&amp;gt; p* the_lnet.ln_net_lock-&amp;gt;pcl_locks[1]
$28 = {
  {
    rlock = {
      raw_lock = {
        {
          head_tail = 2492241038, 
          tickets = {
            head = 38030, 
            tail = 38028
          }
        }
      }
    }
  }
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We&apos;re currently suspecting an issue on the routing path, although a code inspection didn&apos;t reveal anything obvious. We&apos;re continuing to investigate.&lt;/p&gt;

&lt;p&gt;In the mean time Dmitry installed IFS 10.4 on the routers (soak-14 and soak-15), to avoid running into the HFI bug which leads to a deadlock.&lt;/p&gt;

&lt;p&gt;Next time when we run the soak tests using the routers can we turn on net error logging:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl set_param debug=+neterror
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then turn on the debug daemon to capture any relevant logs during the test run.&lt;/p&gt;</comment>
                            <comment id="202076" author="dmiter" created="Thu, 13 Jul 2017 21:06:18 +0000"  >&lt;p&gt;Maybe&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9769&quot; title=&quot;Exit from function with acquired lock (lost lock).&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9769&quot;&gt;&lt;del&gt;LU-9769&lt;/del&gt;&lt;/a&gt; relate to this.&lt;/p&gt;</comment>
                            <comment id="202077" author="ashehata" created="Thu, 13 Jul 2017 21:19:45 +0000"  >&lt;p&gt;After looking at the core, it appears that all the CPUs are stuck on CPT 0. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9769&quot; title=&quot;Exit from function with acquired lock (lost lock).&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9769&quot;&gt;&lt;del&gt;LU-9769&lt;/del&gt;&lt;/a&gt; would have an impact if an older lnetctl was used to delete a net, but provided the wrong net_id. So it could potentially be an issue in that scenario. It would be good to run with that patch just in case..&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="47224">LU-9769</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="27592" name="soak-15.console.gz" size="235266" author="cliffw" created="Wed, 12 Jul 2017 15:06:49 +0000"/>
                            <attachment id="27591" name="vmcore-dmesg.txt" size="1016665" author="cliffw" created="Wed, 12 Jul 2017 15:06:50 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzgj3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>