<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:17:58 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8485] workqueue overflows with mlx5 on power8 platforms.</title>
                <link>https://jira.whamcloud.com/browse/LU-8485</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Currently in my testing on the Power8 platform I from time to time see the following errors on the clients and the lustre becomes unusable.&lt;/p&gt;

&lt;p&gt;[ 3499.198051] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7712): work queue overflow&lt;br/&gt;
[ 3499.198176] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7712): Failed to prepare WQE&lt;br/&gt;
[ 3499.198209] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7715): work queue overflow&lt;br/&gt;
[ 3499.198240] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000001772778c00&lt;br/&gt;
[ 3499.198428] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7715): Failed to prepare WQE&lt;br/&gt;
[ 3499.198527] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000000788600c00&lt;br/&gt;
[ 3499.199804] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06800&lt;br/&gt;
[ 3499.199928] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000788602200&lt;br/&gt;
[ 3499.200740] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000077cec7400&lt;br/&gt;
[ 3499.201667] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000039da2f400&lt;br/&gt;
[ 3499.202216] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000780129c00&lt;br/&gt;
[ 3499.202422] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e270c3000&lt;br/&gt;
[ 3499.202642] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000001b98441800&lt;br/&gt;
[ 3499.202864] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000c6d9fd600&lt;br/&gt;
[ 3499.203091] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000dd0309200&lt;br/&gt;
[ 3499.203942] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06200&lt;br/&gt;
[ 3499.558222] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Abort reconnection of 10.37.248.77@o2ib1: connected&lt;br/&gt;
[ 3499.558317] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Skipped 4 previous similar messages&lt;/p&gt;</description>
                <environment>Power8 client nodes running RHEL7.2 with Mellanox OFED 3.2-1.04</environment>
        <key id="38686">LU-8485</key>
            <summary>workqueue overflows with mlx5 on power8 platforms.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="doug">Doug Oucharek</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                    </labels>
                <created>Mon, 8 Aug 2016 21:50:24 +0000</created>
                <updated>Fri, 12 Aug 2016 21:32:45 +0000</updated>
                            <resolved>Fri, 12 Aug 2016 21:13:59 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                    <version>Lustre 2.9.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="161216" author="pjones" created="Mon, 8 Aug 2016 22:39:43 +0000"  >&lt;p&gt;Doug is looking into this&lt;/p&gt;</comment>
                            <comment id="161221" author="doug" created="Mon, 8 Aug 2016 23:08:09 +0000"  >&lt;p&gt;James, is this failure with or without your patch: &lt;a href=&quot;http://review.whamcloud.com/21304/?&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21304/?&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="161223" author="simmonsja" created="Mon, 8 Aug 2016 23:15:45 +0000"  >&lt;p&gt;With since without patch 21304 ko2iblnd doesn&apos;t work on Power8 platforms. A small bug exist in the patch that I submitted but I have a version locally that appears to work.&lt;/p&gt;</comment>
                            <comment id="161226" author="doug" created="Mon, 8 Aug 2016 23:35:18 +0000"  >&lt;p&gt;Are NETERRORS turned on?  I&apos;m curious to see if o2iblnd has any messages for us to help.&lt;/p&gt;</comment>
                            <comment id="161228" author="simmonsja" created="Tue, 9 Aug 2016 00:11:43 +0000"  >&lt;p&gt;Yes they are on. It will take me some time to get any lctl debug logs since this problem happens randomly.&lt;/p&gt;</comment>
                            <comment id="161229" author="simmonsja" created="Tue, 9 Aug 2016 00:35:28 +0000"  >&lt;p&gt;Here is a lctl dump from my power8 client nodes. For the server side we are using standard x86_64 platforms which is why we are having issues.&lt;/p&gt;</comment>
                            <comment id="161693" author="doug" created="Thu, 11 Aug 2016 23:15:43 +0000"  >&lt;p&gt;Please see my comments on &lt;a href=&quot;http://review.whamcloud.com/21304/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21304/&lt;/a&gt;.  A solution to this ticket may come about when addressing my comments.&lt;/p&gt;</comment>
                            <comment id="161701" author="simmonsja" created="Fri, 12 Aug 2016 02:06:11 +0000"  >&lt;p&gt;I updated the patch and the problem still exist. I will push a new version of the 21304 patch.&lt;/p&gt;</comment>
                            <comment id="161792" author="doug" created="Fri, 12 Aug 2016 21:13:49 +0000"  >&lt;p&gt;After a debugging session, we seem to have tracked down a few problems with the 21304 patch and are close to have a version which works without problems (i.e. overflows).  As such, I&apos;m going to mark this ticket resolved and leave the final changes to 21304 in its ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7650&quot; title=&quot;ko2iblnd map_on_demand can&amp;#39;t negotitate when page sizes are different between nodes.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7650&quot;&gt;&lt;del&gt;LU-7650&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="161794" author="simmonsja" created="Fri, 12 Aug 2016 21:32:45 +0000"  >&lt;p&gt;Worked with Doug to track down the issue reported in this ticket. The main problem was due to the IBLND_SEND_WRS macro in o2iblnd not creating a deep enough queue. It was using the local frag size (16) but it needs to assume the worst case of working with a external node (256) so the queue ended up too small and it would be easily overrun. The latest patch &lt;a href=&quot;http://review.whamcloud.com/21304&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21304&lt;/a&gt; should address these problems.&lt;/p&gt;

&lt;p&gt;Besides the fixes in the 21304 patch the queue problems existed in that it they were to small. The solution to that was to reduce the concurrent_sends from 63 down to 31 and Lustre started to function in all my test cases.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="29170">LU-6387</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="22502" name="dump.log" size="228422" author="simmonsja" created="Tue, 9 Aug 2016 00:35:28 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyjxb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>