<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:08:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7362] During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash</title>
                <link>https://jira.whamcloud.com/browse/LU-7362</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;While testing the latest Lustre pre-2.8 release on one of our Cray systems DVS was enabled by mistake on a router which LNet then crashed with the following backtrace:&lt;/p&gt;

&lt;p&gt;2015-10-27T11:39:24.142188-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker&lt;br/&gt;
2015-10-27T11:39:24.142197-04:00 c0-1c1s1n1 Call Trace:&lt;br/&gt;
2015-10-27T11:39:24.142213-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006651&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
2015-10-27T11:39:24.142225-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004eb9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x430&lt;br/&gt;
2015-10-27T11:39:24.142240-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa025b897&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x57/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-10-27T11:39:24.142255-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa025bde7&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x47/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-10-27T11:39:24.181560-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02f29b6&amp;gt;&amp;#93;&lt;/span&gt; lnet_router_checker+0x566/0x5a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lnet&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-10-27T11:39:24.181581-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81067ace&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
2015-10-27T11:39:24.181609-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81490074&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;br/&gt;
2015-10-27T11:39:24.181616-04:00 c0-1c1s1n1 Kernel panic - not syncing: LBUG&lt;br/&gt;
2015-10-27T11:39:24.181627-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker Tainted: P             3.0.101-0.46.1_1.0502.8871-cray_gem_s #1&lt;br/&gt;
2015-10-27T11:39:24.211395-04:00 c0-1c1s1n1 Call Trace:&lt;br/&gt;
2015-10-27T11:39:24.211415-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81006651&amp;gt;&amp;#93;&lt;/span&gt; try_stack_unwind+0x161/0x1a0&lt;br/&gt;
2015-10-27T11:39:24.211422-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81004eb9&amp;gt;&amp;#93;&lt;/span&gt; dump_trace+0x89/0x430&lt;br/&gt;
2015-10-27T11:39:24.211476-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810060bc&amp;gt;&amp;#93;&lt;/span&gt; show_trace_log_lvl+0x5c/0x80&lt;br/&gt;
2015-10-27T11:39:24.211488-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810060f5&amp;gt;&amp;#93;&lt;/span&gt; show_trace+0x15/0x20&lt;br/&gt;
2015-10-27T11:39:24.211515-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8148b31c&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x79/0x84&lt;br/&gt;
2015-10-27T11:39:24.211531-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8148b3bb&amp;gt;&amp;#93;&lt;/span&gt; panic+0x94/0x1da&lt;br/&gt;
2015-10-27T11:39:24.211560-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa025be4b&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0xab/0xc0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-10-27T11:39:24.211579-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa02f29b6&amp;gt;&amp;#93;&lt;/span&gt; lnet_router_checker+0x566/0x5a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lnet&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-10-27T11:39:24.211586-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81067ace&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x9e/0xb0&lt;br/&gt;
2015-10-27T11:39:24.241857-04:00 c0-1c1s1n1 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81490074&amp;gt;&amp;#93;&lt;/span&gt; kernel_thread_helper+0x4/0x10&lt;/p&gt;

&lt;p&gt;While DVS is a external utility on top of LNet it shouldn&apos;t be able to crash a LNet router.&lt;/p&gt;</description>
                <environment>Cray Routers running latest Lustre pre-2.8</environment>
        <key id="32921">LU-7362</key>
            <summary>During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="doug">Doug Oucharek</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                    </labels>
                <created>Fri, 30 Oct 2015 14:39:20 +0000</created>
                <updated>Thu, 13 Oct 2016 17:07:03 +0000</updated>
                            <resolved>Wed, 18 Nov 2015 01:39:44 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="132187" author="doug" created="Fri, 30 Oct 2015 17:40:29 +0000"  >&lt;p&gt;James: This appears to be from an LASSERT.  Can you tell us which of the two possible asserts in the router checker routine this is coming from: &lt;/p&gt;

&lt;p&gt;LASSERT (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING);&lt;br/&gt;
or&lt;br/&gt;
LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_STOPPING);&lt;/p&gt;

&lt;p&gt;It seems likely to be the first case (check for RUNNING).  Initial thought: we are starting up and launch the router checker thread.  Before it can run, DVS stops LNet (for some reason).  That changes state to STOPPING before the router checker thread starts and does that first assert.&lt;/p&gt;

&lt;p&gt;Personally, I hate asserts for checks like this.  I&apos;d like to address this ticket by removing that first assert and let the router checker loop terminate immediately because the state has changed (i.e. neither one of these asserts is really protecting us from anything and are not valid reasons for crashing a system).&lt;/p&gt;

&lt;p&gt;As DVS is a Cray tool, could someone at Cray comment on whether DVS would be stopping LNet immediately upon startup like this?&lt;/p&gt;</comment>
                            <comment id="132189" author="ezell" created="Fri, 30 Oct 2015 17:51:14 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; sym the_lnet
ffffffffa030cb60 (B) the_lnet [lnet]
crash&amp;gt; lnet_t ffffffffa030cb60 | grep ln_rc_state
  ln_rc_state = 2,
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_SHUTDOWN		0	/* not started */
lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_RUNNING		1	/* started up OK */
lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_STOPPING		2	/* telling thread to stop */
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2015-10-27T11:39:22.941791-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni110 [16/8192/0/0]
2015-10-27T11:39:22.941799-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni111 [16/8192/0/0]
2015-10-27T11:39:22.941805-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni112 [16/8192/0/0]
2015-10-27T11:39:22.941812-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib106 [63/2560/0/180]
2015-10-27T11:39:22.941818-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib225 [63/2560/0/180]
2015-10-27T11:39:24.142128-04:00 c0-1c1s1n1 DVS: dvs_lnet_init: No network ID found on configured lnd (gni100)
2015-10-27T11:39:24.142174-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) ASSERTION( the_lnet.ln_rc_state == 1 ) failed: 
2015-10-27T11:39:24.142181-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</comment>
                            <comment id="132192" author="hornc" created="Fri, 30 Oct 2015 18:04:44 +0000"  >&lt;p&gt;DVS will call LNetNIFini() (stopping LNet) when it can&apos;t find the &quot;network ID...on configured lnd&quot;&lt;/p&gt;</comment>
                            <comment id="132193" author="doug" created="Fri, 30 Oct 2015 18:05:52 +0000"  >&lt;p&gt;Thanks Matt.  That seems to confirm my suspicion.  I&apos;m assuming that dvs_lnet_init() stops LNet because it cannot find the network ID it is expecting.&lt;/p&gt;

&lt;p&gt;So, I&apos;m proposing to remove the offending assert.  Rather than protecting us from anything (which it does not), it has become a problem.&lt;/p&gt;</comment>
                            <comment id="132233" author="gerrit" created="Fri, 30 Oct 2015 21:47:51 +0000"  >&lt;p&gt;Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/17003&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17003&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7362&quot; title=&quot;During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7362&quot;&gt;&lt;del&gt;LU-7362&lt;/del&gt;&lt;/a&gt; lnet: Remove LASSERTS from router checker&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: e6d6f7ab62877c5abc445734f7eaef4a81863796&lt;/p&gt;</comment>
                            <comment id="133474" author="gerrit" created="Fri, 13 Nov 2015 18:26:41 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/17003/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17003/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7362&quot; title=&quot;During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7362&quot;&gt;&lt;del&gt;LU-7362&lt;/del&gt;&lt;/a&gt; lnet: Remove LASSERTS from router checker&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: df6cf859bbb29392064e6ddb701f3357e01b3a13&lt;/p&gt;</comment>
                            <comment id="133802" author="jgmitter" created="Wed, 18 Nov 2015 01:39:44 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxrvr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>