<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:24:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16126] Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients</title>
                <link>https://jira.whamcloud.com/browse/LU-16126</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I have been trying to tune the knob of the&#160;&lt;tt&gt;max_rpcs_in_progress&lt;/tt&gt;&#160;parameter on the MDTs in my system. Environment:&lt;br/&gt;
64 MDTs, one per MDS&lt;br/&gt;
64 OSTs, one per OSS&lt;br/&gt;
512 clients&lt;br/&gt;
All using 2.15.51&lt;/p&gt;

&lt;p&gt;On the MDTs, the&#160;&lt;tt&gt;osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress&lt;/tt&gt;&#160;parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;V-1: Rank   0 Line  2565    Operation               Duration              Rate
V-1: Rank   0 Line  2566    ---------               --------              ----
V-1: Rank   0 Line  1957 main: * iteration 1 *
V-2: Rank   0 Line  1966 main (for j loop): making o.testdir, &apos;/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0&apos;
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0...
V-2: Rank   0 Line  1889 Making directory &apos;/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/&apos;
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1...
V-1: Rank   0 Line  2033 V-1: main:   Tree creation     :          2.250 sec,          0.444 ops/sec
delaying 30 seconds . . .
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328
srun: error: mo0627: tasks 7328-7343: Exited with exit code 255
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;br/&gt;
The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.&lt;/p&gt;

&lt;p&gt;It seems like I may have narrowed this down to commit&#160;&lt;tt&gt;23028efcae &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6864&quot; title=&quot;DNE3: Support multiple modify RPCs in flight for MDT-MDT connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6864&quot;&gt;&lt;del&gt;LU-6864&lt;/del&gt;&lt;/a&gt; osp: manage number of modify RPCs in flight&lt;/tt&gt;. If I revert this commit, I cannot reproduce the error.&lt;/p&gt;</description>
                <environment></environment>
        <key id="72079">LU-16126</key>
            <summary>Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="adilger">Andreas Dilger</assignee>
                                    <reporter username="koutoupis">Petros Koutoupis</reporter>
                        <labels>
                    </labels>
                <created>Tue, 30 Aug 2022 13:04:16 +0000</created>
                <updated>Mon, 23 Jan 2023 15:21:37 +0000</updated>
                            <resolved>Wed, 7 Sep 2022 18:58:48 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="345110" author="ofaaland" created="Tue, 30 Aug 2022 18:47:15 +0000"  >&lt;p&gt;Hi Petros,&lt;br/&gt;
You state you tested with Lustre 2.15, and you mention that the problem seems to be commit:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;23028efcae &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6864&quot; title=&quot;DNE3: Support multiple modify RPCs in flight for MDT-MDT connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6864&quot;&gt;&lt;del&gt;LU-6864&lt;/del&gt;&lt;/a&gt; osp: manage number of modify RPCs in flight&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Since this commit landed to master &lt;ins&gt;&lt;em&gt;after&lt;/em&gt;&lt;/ins&gt; 2.15.0, that confused me and may confuse others (when someone says &quot;2.15&quot; I assume they mean something on branch b2_15). If you change the description and the summary of this ticket to say &quot;2.15.50&quot; or &quot;master post-2.15.0&quot; that would help - thanks.&lt;/p&gt;</comment>
                            <comment id="345112" author="simmonsja" created="Tue, 30 Aug 2022 18:51:25 +0000"  >&lt;p&gt;Do you have kernel logs? I do wonder if this is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16065&quot; title=&quot;replay-single test_81a: rm remote dir failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16065&quot;&gt;LU-16065&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="345121" author="adilger" created="Tue, 30 Aug 2022 19:48:15 +0000"  >&lt;p&gt;Note that we don&apos;t regularly test with 64 MDTs, so there may be some kind of thread starvation happening of all of the MDTs are sending RpcS to each other and the service thread count is not high enough. &lt;/p&gt;

&lt;p&gt;Even before debug logs, getting the console logs from the client and MDS nodes would be useful (eg. if clients are being evicted, stacks are dumped, etc.)&lt;/p&gt;</comment>
                            <comment id="345928" author="koutoupis" created="Wed, 7 Sep 2022 18:58:48 +0000"  >&lt;p&gt;I went back to collect more data and unfortunately, I am unable to reproduce the original issue. Closing this ticket unless I can observe the problem again.&lt;/p&gt;</comment>
                            <comment id="353188" author="nrutman" created="Wed, 16 Nov 2022 16:55:48 +0000"  >&lt;p&gt;Another problem: you can&apos;t get it back to 0:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[root@cslmo4902 ~]# lctl get_param *.*.max_rpcs_in_progress
osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
osp.testfs-MDT0001-osp-MDT0000.max_rpcs_in_progress=0
[root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
[root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
error: set_param: setting /sys/fs/lustre/osp/testfs-MDT0000-osp-MDT0001/max_rpcs_in_progress=0: Numerical result out of range
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="353189" author="koutoupis" created="Wed, 16 Nov 2022 17:25:42 +0000"  >&lt;p&gt;Nathan, I did notice that but forgot about that quirk until now. Thank you for refreshing my memory.&lt;/p&gt;</comment>
                            <comment id="353193" author="simmonsja" created="Wed, 16 Nov 2022 18:01:00 +0000"  >&lt;p&gt;That is the same behavior for mdc.*.max_rpcs_in_progress. Its due to the test:&lt;/p&gt;

&lt;p&gt;&#160;&#160;if (max &amp;gt; OBD_MAX_RIF_MAX || max &amp;lt; 1) &lt;br/&gt;
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;return -ERANGE;&lt;/p&gt;

&lt;p&gt;in obd_set_max_rpcs_in_flight(). Setting it back to 1 is the default setting. Zero would mean don&apos;t send anything which we don&apos;t want.&lt;/p&gt;</comment>
                            <comment id="353207" author="adilger" created="Wed, 16 Nov 2022 18:57:31 +0000"  >&lt;p&gt;So maybe the issue here is that the max_rpcs_in_progress value shouldn&apos;t show &quot;0&quot; initially?&lt;/p&gt;</comment>
                            <comment id="353237" author="simmonsja" created="Wed, 16 Nov 2022 21:45:54 +0000"  >&lt;p&gt;Yeah the initial value is wrong.&lt;/p&gt;</comment>
                            <comment id="353364" author="gerrit" created="Thu, 17 Nov 2022 17:12:41 +0000"  >&lt;p&gt;&quot;James Simmons &amp;lt;jsimmons@infradead.org&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49182&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49182&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16126&quot; title=&quot;Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16126&quot;&gt;&lt;del&gt;LU-16126&lt;/del&gt;&lt;/a&gt; ldlm: set default rpcs_in_flight to one&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ee2925431a7c2fb0559fd02a4e10a9b4a85f57aa&lt;/p&gt;</comment>
                            <comment id="360052" author="simmonsja" created="Mon, 23 Jan 2023 15:21:37 +0000"  >&lt;p&gt;I took another look at this and in my testing I see:&lt;/p&gt;

&lt;p&gt;/usr/src/lustre-2.15.52/lustre/tests# ../utils/lctl get_param &lt;b&gt;.&lt;/b&gt;.max_rpcs_* &lt;br/&gt;
mdc.lustre-MDT0000-mdc-ffff997ccd58f000.max_rpcs_in_flight=8 &lt;br/&gt;
osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8 &lt;br/&gt;
osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096 &lt;br/&gt;
osc.lustre-OST0000-osc-ffff997ccd58f000.max_rpcs_in_flight=8 &lt;br/&gt;
osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8 &lt;br/&gt;
osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096 &lt;br/&gt;
osc.lustre-OST0001-osc-ffff997ccd58f000.max_rpcs_in_flight=8 &lt;br/&gt;
osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8 &lt;br/&gt;
osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096 &lt;br/&gt;
osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8 &lt;br/&gt;
osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096&lt;/p&gt;



&lt;p&gt;Nathan what version of Lustre you are running that you see a setting of zero?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="31120">LU-6864</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02yh3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>