<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:33:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3447] Client RDMA too fragmented: 128/255 src 128/256 dst frags</title>
                <link>https://jira.whamcloud.com/browse/LU-3447</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;During an IOR-like benchmark doing directIO from multiple clients (16, 64) clients get disconnected and evicted. The MPI process dies in misery and some of it&apos;s processes aren&apos;t even killable.&lt;/p&gt;

&lt;p&gt;We&apos;ve seen that there was a similar bug a while ago that was marked as solved, it was occuring on lnet routers (&lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=13607&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=13607&lt;/a&gt;). This one is on clients.&lt;/p&gt;

&lt;p&gt;What can lead to the &quot;RDMA too fragmented&quot; issue? Any hint or suggestion? Client log messages are in the attached file.&lt;/p&gt;

&lt;p&gt;Regards,&lt;br/&gt;
Erich&lt;/p&gt;</description>
                <environment>Lustre servers running 2.1.5, Lustre clients with 1.8.9.</environment>
        <key id="19350">LU-3447</key>
            <summary>Client RDMA too fragmented: 128/255 src 128/256 dst frags</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="efocht">Erich Focht</reporter>
                        <labels>
                            <label>client</label>
                    </labels>
                <created>Mon, 10 Jun 2013 17:08:43 +0000</created>
                <updated>Sat, 15 Mar 2014 01:21:36 +0000</updated>
                            <resolved>Sat, 15 Mar 2014 01:21:36 +0000</resolved>
                                    <version>Lustre 2.1.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="60350" author="efocht" created="Tue, 11 Jun 2013 14:42:44 +0000"  >&lt;p&gt;Increasing the MTT size on the client nodes seems to solve the problem. For instructions: &lt;a href=&quot;http://community.mellanox.com/docs/DOC-1120&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://community.mellanox.com/docs/DOC-1120&lt;/a&gt;&lt;br/&gt;
We&apos;ve set log_num_mtt to 24.&lt;/p&gt;

&lt;p&gt;Having a more meaningful error message would be nice.&lt;/p&gt;

&lt;p&gt;This bug can be closed.&lt;/p&gt;</comment>
                            <comment id="60432" author="bfaccini" created="Wed, 12 Jun 2013 12:30:21 +0000"  >&lt;p&gt;Hello Eric,&lt;br/&gt;
Thank&apos;s for the hint that solved the issue on your side.&lt;br/&gt;
But to be complete on this it would be nice to give a try to the &quot;map_on_demand&quot; dynamic feature (o2iblnd proc/module parameter, but this has to be set on all nodes) that may also be a way to fix such problem.&lt;/p&gt;
</comment>
                            <comment id="60438" author="pjones" created="Wed, 12 Jun 2013 14:15:54 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="60527" author="efocht" created="Thu, 13 Jun 2013 09:00:12 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;is that option available on 1.8.9 as well as on 2.X? Thanks for pointing me to it!&lt;/p&gt;

&lt;p&gt;It is difficult to do that in the customer&apos;s environment if we need to set this on both clients and servers, he has 3-4 Lustre filesystems (not all from us), a mix of versions, and 3.5k clients. But I&apos;ll try to find an opportunity to do it and discuss with the customer.&lt;/p&gt;

&lt;p&gt;Best regards,&lt;br/&gt;
Erich&lt;/p&gt;</comment>
                            <comment id="60652" author="bfaccini" created="Fri, 14 Jun 2013 13:36:34 +0000"  >&lt;p&gt;Hello Eric,&lt;br/&gt;
Working more on this very un-frequent problem, it seems highly possible that it is caused by upper-layer/application doing big and un-aligned I/Os. Since you indicated that your customer got it when running some MPI application doing Direct-IOs, can you also check on his side about the fact that these I/Os could be unaligned (page boundaries) and about their size ??&lt;/p&gt;</comment>
                            <comment id="62180" author="bfaccini" created="Fri, 12 Jul 2013 13:23:37 +0000"  >&lt;p&gt;Hello Eric,&lt;br/&gt;
Any news on your side ??&lt;/p&gt;</comment>
                            <comment id="62966" author="efocht" created="Thu, 25 Jul 2013 13:50:19 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;unfortunately we cannot use the module option there. It is a huge enironment with several Lustre setups and the customer is not willing to switch that option over everywhere. Which we&apos;d need to do (as far as I understand) on clients as well as on servers. So we can&apos;t switch the clients selectively over. But we will test it as soon as we can on another (upcoming) installation.&lt;/p&gt;

&lt;p&gt;Regards,&lt;br/&gt;
Erich&lt;/p&gt;</comment>
                            <comment id="78809" author="jfc" created="Sat, 8 Mar 2014 02:06:58 +0000"  >&lt;p&gt;Erich,&lt;br/&gt;
Do you want us to keep this ticket open?&lt;br/&gt;
Maybe you have had a chance to test the issue on a later installation?&lt;br/&gt;
Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="79398" author="jfc" created="Sat, 15 Mar 2014 01:21:36 +0000"  >&lt;p&gt;Customer was able to resolve problem. No more required here.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="13020" name="client_log_messages" size="10781" author="efocht" created="Mon, 10 Jun 2013 17:08:43 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvsyf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8618</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>