<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:23:06 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2187] Why are we losing messages?</title>
                <link>https://jira.whamcloud.com/browse/LU-2187</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;According to the analysis of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1717&quot; title=&quot;mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1717&quot;&gt;&lt;del&gt;LU-1717&lt;/del&gt;&lt;/a&gt; we are frequently losing Lustre messages on Sequoia&apos;s IB network.  We have no LNet routers, and IB is a reliable network.  We are not seeing any timeouts or lnet errors that would suggest that we are seeing IB transmission problems.&lt;/p&gt;

&lt;p&gt;Why are messages being lost that can result in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1717&quot; title=&quot;mdt_recovery.c:611:mdt_steal_ack_locks()) Resent req xid XXX has mismatched opc: new 101 old 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1717&quot;&gt;&lt;del&gt;LU-1717&lt;/del&gt;&lt;/a&gt; error messages?  I&apos;m worried that we&apos;re papering over a larger problem by silencing those errors.&lt;/p&gt;</description>
                <environment></environment>
        <key id="16373">LU-2187</key>
            <summary>Why are we losing messages?</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="doug">Doug Oucharek</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>sequoia</label>
                    </labels>
                <created>Mon, 15 Oct 2012 22:56:12 +0000</created>
                <updated>Thu, 13 Feb 2014 01:54:58 +0000</updated>
                            <resolved>Thu, 13 Feb 2014 01:54:58 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="46742" author="doug" created="Thu, 18 Oct 2012 13:42:51 +0000"  >&lt;p&gt;Question: Do we know if the Lustre messages are being &quot;lost&quot; or are just delayed to the point that they are considered lost by the Lustre code in question?&lt;/p&gt;

&lt;p&gt;The IB network is reliable, but can delay messages to the point we take action assuming the message is lost.&lt;/p&gt;</comment>
                            <comment id="46759" author="morrone" created="Thu, 18 Oct 2012 20:23:03 +0000"  >&lt;p&gt;I can&apos;t say definitively, but I strongly suspect that the messages aren&apos;t really lost, but this is more bad Lustre behaviour making poor AT assumptions or delaying send of messages or something.&lt;/p&gt;</comment>
                            <comment id="46817" author="isaac" created="Sun, 21 Oct 2012 01:40:20 +0000"  >&lt;p&gt;There&apos;s a few things to do to diagnose it:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;/proc/sys/lnet/* would be very useful. Please gather them once such errors happen again.&lt;/li&gt;
	&lt;li&gt;&quot;lctl --net o2ib0 conn_list (or list_conn)&quot; gives useful data at o2iblnd layer. Please run it once such errors happen again.&lt;/li&gt;
	&lt;li&gt;Many LNet/LND errors don&apos;t go to dmesg by default, please on a node where such errors have occurred, do a &apos;echo +neterror &amp;gt; /proc/sys/lnet/printk&apos; after each reboot.&lt;/li&gt;
	&lt;li&gt;Please also check IB errors by running &apos;ibcheckerrors&apos;.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="46844" author="morrone" created="Mon, 22 Oct 2012 14:52:51 +0000"  >&lt;p&gt;Isaac, I think it is very unlikely that we&apos;ll see anything useful in there.  I&apos;ll certainly keep an eye on it, but I do not believe that we&apos;re seeing real network problems here.  We&apos;ve investigated the network pretty thoroughly and it all checks out.&lt;/p&gt;

&lt;p&gt;I think we need to look higher in the stack.  I don&apos;t think we&apos;ve really lost the message.  But my suspicion is that the server sent the reply too slowly, and the client sent the retry before the server every replied once.&lt;/p&gt;

&lt;p&gt;So how do we investigate that problem?&lt;/p&gt;</comment>
                            <comment id="46935" author="doug" created="Thu, 25 Oct 2012 18:12:45 +0000"  >&lt;p&gt;When this happens, are there any ptlrpc timeout logs in the console?  Theoretically, if there is a resend, there should be a timeout log.&lt;/p&gt;</comment>
                            <comment id="47592" author="green" created="Thu, 8 Nov 2012 11:47:08 +0000"  >&lt;p&gt;I remember a set of patches from llnl that disabled all that &quot;noise&quot; like printing about resent RPCs. I wonder if this i applied by default on their deployments?&lt;/p&gt;</comment>
                            <comment id="47734" author="doug" created="Tue, 13 Nov 2012 13:10:26 +0000"  >&lt;p&gt;Chris: Have you been able to &quot;turn up&quot; the RPC logging to see if there are resend logs?&lt;/p&gt;</comment>
                            <comment id="47787" author="johann" created="Wed, 14 Nov 2012 10:22:23 +0000"  >&lt;p&gt;Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?&lt;/p&gt;</comment>
                            <comment id="47800" author="morrone" created="Wed, 14 Nov 2012 14:08:19 +0000"  >&lt;p&gt;We do change it from D_WARNING to D_NETERROR.  Our code is in git on github.com/chaos/lustre.  Our most recent branch is &lt;a href=&quot;https://github.com/chaos/lustre/tree/2.3.54-llnl&quot; title=&quot;2.3.54-llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;2.3.54-llnl&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="47943" author="isaac" created="Fri, 16 Nov 2012 13:16:57 +0000"  >&lt;p&gt;Chris, D_NETERROR messages only go to the Lustre debug log without an explicit &apos;echo +neterror &amp;gt; /proc/sys/lnet/printk&apos;.&lt;/p&gt;</comment>
                            <comment id="76929" author="morrone" created="Thu, 13 Feb 2014 01:54:58 +0000"  >&lt;p&gt;We will never get back to this one.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="15430">LU-1717</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvagn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5230</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>