<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:48:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11931] RDMA packets sent from client to MGS are timing out </title>
                <link>https://jira.whamcloud.com/browse/LU-11931</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have seen in a production system the following error which are causing clients to be evicted.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;85895.120239&amp;#93;&lt;/span&gt; LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;85895.130310&amp;#93;&lt;/span&gt; LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;123887.254790&amp;#93;&lt;/span&gt; Lustre: MGS: haven&apos;t heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995&lt;/p&gt;

&lt;p&gt;For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:&lt;/p&gt;

&lt;p&gt;F1 file system server backend with 2.8.2 stack, ldiskfs:&lt;/p&gt;

&lt;p&gt;&#160; &#160; map_on_demand:0&lt;/p&gt;

&lt;p&gt;&#160; &#160; concurrent_sends:0&lt;/p&gt;

&lt;p&gt;&#160; &#160; peer_credits:8&lt;/p&gt;

&lt;p&gt;F2 file system server 2.11 (ZFS 0.7.12)&lt;/p&gt;

&lt;p&gt;&#160; &#160; map_on_demand:1&lt;/p&gt;

&lt;p&gt;&#160; &#160; concurrent_sends:63&lt;/p&gt;

&lt;p&gt;&#160; &#160; peer_credits:8&lt;/p&gt;

&lt;p&gt;C3 (cray 2.11 router)&lt;/p&gt;

&lt;p&gt;&#160; &#160;map_on_demand:0&lt;/p&gt;

&lt;p&gt;&#160; &#160;concurrent_sends:16&lt;/p&gt;

&lt;p&gt;&#160; &#160;peer_credits:8 (o2ib)&lt;/p&gt;

&lt;p&gt;&#160; &#160;peer_credits:16 (gni). &lt;/p&gt;

&lt;p&gt;C4 (cray 2.11 router)&lt;/p&gt;

&lt;p&gt;&#160; &#160;map_on_demand:0&lt;/p&gt;

&lt;p&gt;&#160; &#160;concurrent_sends:63&lt;/p&gt;

&lt;p&gt;&#160; &#160;peer_credits:8 (o2ib) &lt;/p&gt;

&lt;p&gt;&#160; &#160;peer_credits:16 (gni)&lt;/p&gt;

&lt;p&gt;Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.&lt;/p&gt;</description>
                <environment>Cray CLE6 system running 2.11 clients with 2.11 servers.</environment>
        <key id="54787">LU-11931</key>
            <summary>RDMA packets sent from client to MGS are timing out </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                            <label>ORNL</label>
                    </labels>
                <created>Tue, 5 Feb 2019 19:34:05 +0000</created>
                <updated>Wed, 22 May 2019 14:15:51 +0000</updated>
                            <resolved>Sun, 21 Apr 2019 13:21:24 +0000</resolved>
                                    <version>Lustre 2.11.0</version>
                                    <fixVersion>Lustre 2.13.0</fixVersion>
                    <fixVersion>Lustre 2.12.1</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="241455" author="pjones" created="Wed, 6 Feb 2019 14:28:30 +0000"  >&lt;p&gt;James&lt;/p&gt;

&lt;p&gt;So do I understand correctly that F2 is vanilla 2.11.0 code whereas C3 and C4 have patches applied by Cray (some of which may not have landed to master yet)?&lt;/p&gt;

&lt;p&gt;Amir&lt;/p&gt;

&lt;p&gt;What do you advise here?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="241488" author="ashehata" created="Wed, 6 Feb 2019 19:05:08 +0000"  >&lt;p&gt;James and I investigated this issue. We&apos;re currently suspecting it&apos;s due to peer_credits set to 8. There has been a change in 2.11:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10459&quot; title=&quot;LBUG o2iblnd_cb.c:991:kiblnd_check_sends_locked()) ASSERTION( conn-&amp;gt;ibc_nsends_posted &amp;lt;= conn-&amp;gt;ibc_queue_depth ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10459&quot;&gt;&lt;del&gt;LU-10459&lt;/del&gt;&lt;/a&gt; lnd: throttle tx based on queue depth&lt;/p&gt;

&lt;p&gt;which throttles based on the queue depth as opposed to concurrent_sends which has been removed. In 2.8 concurrent_sends value would get set to 16. We&apos;re trying to do two things: 1) check to see if we can bump the queue depth to 16 (will require some interop testing with 2.8 on their test cluster) 2) see if we need to change the throttling code.&lt;/p&gt;</comment>
                            <comment id="241497" author="gerrit" created="Wed, 6 Feb 2019 20:28:00 +0000"  >&lt;p&gt;Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34200&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34200&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt; lnd: relax throttling limit&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 69f5e8e08a19f6fba884cf571de4b9a1307bed9a&lt;/p&gt;</comment>
                            <comment id="243550" author="simmonsja" created="Fri, 8 Mar 2019 14:19:06 +0000"  >&lt;p&gt;In our testing of the patch we did see this:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@f2-util01 14:46:16&amp;#93;&lt;/span&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=&quot; class=&quot;user-hover&quot; rel=&quot;&quot;&gt;&lt;/a&gt;#&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30311:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn-&amp;gt;ibc_nsends_posted &amp;lt;= conn-&amp;gt;ibc_queue_depth ) failed:&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30312:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn-&amp;gt;ibc_nsends_posted &amp;lt;= conn-&amp;gt;ibc_queue_depth ) failed:&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30312:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30305:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) ASSERTION( conn-&amp;gt;ibc_nsends_posted &amp;lt;= conn-&amp;gt;ibc_queue_depth ) failed:&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30305:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:LNetError: 30311:0:(o2iblnd_cb.c:1019:kiblnd_check_sends_locked()) LBUG&lt;br/&gt;
Message from syslogd@f2-util01.ncrc.gov at Mar 7 15:11:37 ...&lt;br/&gt;
 kernel:Kernel panic - not syncing: LBUG&lt;/p&gt;</comment>
                            <comment id="243563" author="ashehata" created="Fri, 8 Mar 2019 18:37:52 +0000"  >&lt;p&gt;Yes, this assert needs to change. Although, I&apos;m considering now that it might be a good idea to bring back concurrent_sends. Initially, I was thinking that it&apos;s enough to limit the number of txs by queue depth, but it seems like in order to saturate the link you might want to increase the concurrent_sends to over the queue depth. This will lead to queued txs, but it might be necessary to make sure that we maximize the bandwidth&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="243679" author="gerrit" created="Mon, 11 Mar 2019 19:23:41 +0000"  >&lt;p&gt;Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34396&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34396&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt; lnd: bring back concurrent_sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2dead91d20b77e6c279aa1ca048b53d6f5617b10&lt;/p&gt;</comment>
                            <comment id="245311" author="simmonsja" created="Fri, 5 Apr 2019 17:56:30 +0000"  >&lt;p&gt;We have released this patch into our production server system and it has resolved the peer credit starvation issues on MGS that was causing client evictions. The work around before the patch was to remove a batch of client nodes until the evictions stopped. Now with the patch in production we have all the clients back in use. Please consider landing this for 2.12 LTS.&lt;/p&gt;</comment>
                            <comment id="245314" author="pjones" created="Fri, 5 Apr 2019 18:22:18 +0000"  >&lt;p&gt;We&apos;ll land it to b2_12 as soon as it&apos;s landed to master. ATM it&apos;s the -1 review from you gating that. Are you willing to reconsider that -1 in light of the success of the patch or did you actually revise the patch as you have suggested before applying it in production?&lt;/p&gt;</comment>
                            <comment id="245315" author="simmonsja" created="Fri, 5 Apr 2019 18:27:30 +0000"  >&lt;p&gt;Actually their are two patches. I nicked the one patch but I like the other patch that restored the concurrent_send functionality. Also the other patch is what we ended up running in production&lt;/p&gt;</comment>
                            <comment id="245324" author="pjones" created="Sat, 6 Apr 2019 01:40:31 +0000"  >&lt;p&gt;Ah I see. So then we&apos;ll press to get &#160;34396 landed and 34200 should probably move to being tracked under a new Jira reference. That way it&apos;ll be simplest to ensure that we get the desired patch into 2.12.1&lt;/p&gt;</comment>
                            <comment id="245620" author="gerrit" created="Thu, 11 Apr 2019 21:21:02 +0000"  >&lt;p&gt;Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34646&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34646&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt; lnd: bring back concurrent_sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f08724454d919717e8a70555cf1194acddc731ad&lt;/p&gt;</comment>
                            <comment id="246118" author="gerrit" created="Sun, 21 Apr 2019 05:47:56 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/34396/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34396/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt; lnd: bring back concurrent_sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 83e45ead69babfb2909a3157f054fcd8fdf33360&lt;/p&gt;</comment>
                            <comment id="246123" author="gerrit" created="Sun, 21 Apr 2019 06:12:07 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/34646/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34646/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt; lnd: bring back concurrent_sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 056fe83188f0a24de9e27248a7574c8fae867163&lt;/p&gt;</comment>
                            <comment id="246129" author="pjones" created="Sun, 21 Apr 2019 13:21:24 +0000"  >&lt;p&gt;ok. Main patch landed for 2.13. 34200 should be tracked under a new Jira ticket&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="49467">LU-10291</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="55600">LU-12279</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00b1r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>