<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:25:17 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16244] after dropping router during ior lst add_group fails with &quot;create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22&quot;, lnetctl ping fails intermittently</title>
                <link>https://jira.whamcloud.com/browse/LU-16244</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;(Update: this ticket was created while investigating another issue which we think we&apos;ve reproduced.)&lt;/p&gt;

&lt;p&gt;After running a test in which a router (mutt2) was powered off during an ior run, then the router was brought back up, we observe problems in the logs where one of the client nodes involved (mutt18) in the ior keeps reporting its route through mutt2 is going up and down.&lt;/p&gt;

&lt;p&gt;After mutt2 came back up, we observe lctl ping to it to to fail intermittently. Similarly, we see lnet_selftest fail between mutt2 and mutt8.&lt;/p&gt;

&lt;p&gt;After mutt2 is back up, we start seeing LNet and LNetError console log messages:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;On the router node that was rebooted, mutt2,, console log messages like these with the NID of a compute node
	&lt;ul&gt;
		&lt;li&gt;&quot;LNetError.*kiblnd_cm_callback.*DISCONNECTED&quot;&lt;/li&gt;
		&lt;li&gt;&quot;LNetError.*kiblnd_reconnect_peer.*Abort reconnection of&quot;&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;On the router nodes that were not rebooted, console log messages like these
	&lt;ul&gt;
		&lt;li&gt;&quot;LNet:.*kiblnd_handle_rx.*PUT_NACK&quot; with the NID of a garter node (an OSS)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;On the compute node mutt18, console log messages like these with the NID of the router that was power cycled
	&lt;ul&gt;
		&lt;li&gt;&quot;LnetError.*kiblnd_post_locked.*Error -22 posting transmit&quot;&lt;/li&gt;
		&lt;li&gt;&quot;LNetError.*lnet_set_route_aliveness.*route to ... has gone from up to down&quot;&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;from the mutt18 console logs, 192.168.128.2@o2ib44 is mutt2&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2022-10-27 09:40:42 [577896.982939] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
2022-10-27 11:42:44 [585218.762479] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
2022-10-27 11:42:48 [585222.858555] LNetError: 201531:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
2022-10-27 12:27:25 [587899.915999] LNetError: 201515:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
2022-10-27 12:27:48 [587923.166182] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
2022-10-27 12:44:56 [588951.269523] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
2022-10-27 12:44:57 [588952.294216] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Original description:&lt;/p&gt;

&lt;p&gt;LNet selftest session cannot be created because &quot;lst add_group fails&quot;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lst add_group anodes mutt110@o2ib44
create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22
No nodes added successfully, deleting group anodes
Group is deleted

# lctl ping mutt110@o2ib44
12345-0@lo
12345-192.168.128.110@o2ib44

## I had already loaded the lnet_selftest module on mutt110, but the error is the same either way
# pdsh -w emutt110 lsmod | grep lnet_selftest
emutt110: lnet_selftest &#160; &#160; &#160; &#160; 270336&#160; 0
emutt110: lnet&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 704512&#160; 9 osc,ko2iblnd,obdclass,ptlrpc,lnet_selftes ,mgc,lmv,lustre
emutt110: libcfs&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 266240&#160; 13 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lnet_selftest,mgc,lov,mdc,lmv,lustre&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In addition to the lnet_selftest failure we see intermittent ping failures. So the issue is not lnet_selftest itself (as I first believed) but a more general problem.&lt;/p&gt;

&lt;p&gt;For my tracking purposes, our local ticket is TOSS5812&lt;/p&gt;</description>
                <environment>lustre-2.15.1_7.llnl&lt;br/&gt;
4.18.0-372.26.1.1toss.t4.x86_64&lt;br/&gt;
omnipath</environment>
        <key id="72826">LU-16244</key>
            <summary>after dropping router during ior lst add_group fails with &quot;create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22&quot;, lnetctl ping fails intermittently</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="cbordage">Cyril Bordage</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Mon, 17 Oct 2022 22:33:06 +0000</created>
                <updated>Fri, 26 May 2023 21:00:19 +0000</updated>
                            <resolved>Fri, 19 May 2023 21:41:07 +0000</resolved>
                                    <version>Lustre 2.15.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="349943" author="pjones" created="Mon, 17 Oct 2022 23:48:33 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="350032" author="ofaaland" created="Tue, 18 Oct 2022 17:47:48 +0000"  >&lt;p&gt;Gian-Carlo has more information and will update the ticket.&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="350041" author="defazio" created="Tue, 18 Oct 2022 19:13:50 +0000"  >&lt;p&gt;These outputs are from using the wrapper at the end of &lt;a href=&quot;https://wiki.lustre.org/LNET_Selftest&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.lustre.org/LNET_Selftest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&apos;m testing between a router node and a client node in these tests over an opa network&lt;/p&gt;

&lt;p&gt;First a node group (of jut one node) fails to add&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@mutt29:defazio1]# ./selftest.sh
LST_SESSION = 309315
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.128.29@o2ib44 are added to session
create session RPC failed on 12345-192.168.128.3@o2ib44: Unknown error -110
No nodes added successfully, deleting group lto
Group is deleted
Can&apos;t get count of nodes from lto: No such file or directory
bulk_read is running now
Capturing statistics for 10 secs Invalid nid: lto
Failed to get count of nodes from lto: Success
./selftest.sh: line 55: kill: (309323) - No such processInvalid nid: lto
Failed to get count of nodes from lto: Success
Batch is stopped
session is ended&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but I also had a run where the `lto` node added successfully, but the test still failed&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@mutt29:defazio1]# ./selftest.sh
LST_SESSION = 309335
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.128.29@o2ib44 are added to session
192.168.128.3@o2ib44 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 10 secs [LNet Rates of lfrom]
[R] Avg: 20 &#160; &#160; &#160; RPC/s Min: 20 &#160; &#160; &#160; RPC/s Max: 20 &#160; &#160; &#160; RPC/s
[W] Avg: 9 &#160; &#160; &#160; &#160;RPC/s Min: 9 &#160; &#160; &#160; &#160;RPC/s Max: 9 &#160; &#160; &#160; &#160;RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 9.80 &#160; &#160; MiB/s Min: 9.80 &#160; &#160; MiB/s Max: 9.80 &#160; &#160; MiB/s&#160;
[W] Avg: 0.00 &#160; &#160; MiB/s Min: 0.00 &#160; &#160; MiB/s Max: 0.00 &#160; &#160; MiB/s&#160;
[LNet Rates of lto]
[R] Avg: 9 &#160; &#160; &#160; &#160;RPC/s Min: 9 &#160; &#160; &#160; &#160;RPC/s Max: 9 &#160; &#160; &#160; &#160;RPC/s
[W] Avg: 18 &#160; &#160; &#160; RPC/s Min: 18 &#160; &#160; &#160; RPC/s Max: 18 &#160; &#160; &#160; RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00 &#160; &#160; MiB/s Min: 0.00 &#160; &#160; MiB/s Max: 0.00 &#160; &#160; MiB/s&#160;
[W] Avg: 8.80 &#160; &#160; MiB/s Min: 8.80 &#160; &#160; MiB/s Max: 8.80 &#160; &#160; MiB/s lfrom:
12345-192.168.128.29@o2ib44: [Session 6 brw errors, 0 ping errors] [RPC: 17 errors, 0 dropped, 0 expired]
Total 1 error nodes in lfrom
lto:
RPC failure, can&apos;t show error on 12345-192.168.128.3@o2ib44
Total 1 error nodes in lto
1 batch in stopping 
...
(repeat 52 times total)
...
1 batch in stopping
Batch is stopped
session is ended&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and here&apos;s a successful test on a different cluster running the same version of lustre&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@opal64:~]# ./selftest.sh&#160;
LST_SESSION = 284017
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.128.4@o2ib18 are added to session
192.168.128.125@o2ib18 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 10 secs [LNet Rates of lfrom]
[R] Avg: 13878 &#160; &#160;RPC/s Min: 13878 &#160; &#160;RPC/s Max: 13878 &#160; &#160;RPC/s
[W] Avg: 6938 &#160; &#160; RPC/s Min: 6938 &#160; &#160; RPC/s Max: 6938 &#160; &#160; RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 6940.33 &#160;MiB/s Min: 6940.33 &#160;MiB/s Max: 6940.33 &#160;MiB/s&#160;
[W] Avg: 1.06 &#160; &#160; MiB/s Min: 1.06 &#160; &#160; MiB/s Max: 1.06 &#160; &#160; MiB/s&#160;
[LNet Rates of lto]
[R] Avg: 6939 &#160; &#160; RPC/s Min: 6939 &#160; &#160; RPC/s Max: 6939 &#160; &#160; RPC/s
[W] Avg: 13879 &#160; &#160;RPC/s Min: 13879 &#160; &#160;RPC/s Max: 13879 &#160; &#160;RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.06 &#160; &#160; MiB/s Min: 1.06 &#160; &#160; MiB/s Max: 1.06 &#160; &#160; MiB/s&#160;
[W] Avg: 6940.73 &#160;MiB/s Min: 6940.73 &#160;MiB/s Max: 6940.73 &#160;MiB/s lfrom:
Total 0 error nodes in lfrom
lto:
Total 0 error nodes in lto
1 batch in stopping
Batch is stopped
session is ended
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="350111" author="cbordage" created="Wed, 19 Oct 2022 09:30:59 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;to find out what is going on, I would need debugging logs. For that, could you enable LNet traces with:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sh&quot;&gt;
lctl set_param debug=+net
lctl clear
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, run your selftest script and dump the logs with:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sh&quot;&gt;
lctl dk &amp;gt; logfile.txt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And finally, disable logging with&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sh&quot;&gt;
lctl set_param debug=-net
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Could you give me the values of the parameters for selftest script, to have the context and understand the logs?&lt;/p&gt;

&lt;p&gt;Also, could you do the same for ping and capture an error?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="351020" author="defazio" created="Thu, 27 Oct 2022 20:19:14 +0000"  >&lt;p&gt;I&apos;ve uploaded mutt2-mutt18.tar.gz which has lnet_selftests and lnetctl pings between mutt2 (router) and mutt18 (client).&lt;/p&gt;

&lt;p&gt;Note that these are not the same nodes and the above posts, they are the nodes mentions in the updated ticket description.&lt;/p&gt;

&lt;p&gt;The lnet_selftest wrapper used is from &lt;a href=&quot;https://wiki.lustre.org/LNET_Selftest&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.lustre.org/LNET_Selftest&lt;/a&gt; with the nids for mutt2&#160; and mutt18&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#!/bin/sh
#
# Simple wrapper script for LNET Selftest
## Parameters are supplied as environment variables
# The defaults are reasonable for quick verification.
# For in-depth benchmarking, increase the time (TM)
# variable to e.g. 60 seconds, and iterate over
# concurrency to find optimal values.
#
# Reference: http://wiki.lustre.org/LNET_Selftest# Concurrency
CN=${CN:-32}
#Size
SZ=${SZ:-1M}
# Length of time to run test (secs)
TM=${TM:-10}
# Which BRW test to run (read or write)
BRW=${BRW:-&quot;read&quot;}
# Checksum calculation (simple or full)
CKSUM=${CKSUM:-&quot;simple&quot;}# The LST &quot;from&quot; list -- e.g. Lustre clients. Space separated list of NIDs.
# LFROM=&quot;10.10.2.21@tcp&quot;
LFROM=192.168.128.18@o2ib44
# The LST &quot;to&quot; list -- e.g. Lustre servers. Space separated list of NIDs.
# LTO=&quot;10.10.2.22@tcp&quot;
LTO=192.168.128.2@o2ib44

### End of customisation.

export LST_SESSION=$$
echo LST_SESSION = ${LST_SESSION}
lst new_session lst${BRW}
lst add_group lfrom ${LFROM}
lst add_group lto ${LTO}
lst add_batch bulk_${BRW}
lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \
&#160; --concurrency=${CN} check=${CKSUM} size=${SZ}
lst run bulk_${BRW}
echo -n &quot;Capturing statistics for ${TM} secs &quot;
lst stat lfrom lto &amp;amp;
LSTPID=$!
# Delay loop with interval markers displayed every 5 secs.
# Test time is rounded up to the nearest 5 seconds.
i=1
j=$((${TM}/5))
if [ $((${TM}%5)) -ne 0 ]; then let j++; fi
while [ $i -le $j ]; do
&#160; sleep 5
&#160; let i++
done
kill ${LSTPID} &amp;amp;&amp;amp; wait ${LISTPID} &amp;gt;/dev/null 2&amp;gt;&amp;amp;1
echo
lst show_error lfrom lto
lst stop bulk_${BRW}
lst end_session
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="351022" author="defazio" created="Thu, 27 Oct 2022 20:28:52 +0000"  >&lt;p&gt;I&apos;ve logs for nodes involved in the stalled ior job, these logs were collected after the job was already stuck.&lt;/p&gt;

&lt;p&gt;garter_1-8_debug_with_net.gz (server)&lt;/p&gt;

&lt;p&gt;mutt_1-4_debug_with_net.gz (router)&lt;/p&gt;

&lt;p&gt;mutt_10-11_17-18_27-28_30-31_debug_with_net.gz (clients)&lt;/p&gt;</comment>
                            <comment id="351033" author="ofaaland" created="Thu, 27 Oct 2022 21:34:53 +0000"  >&lt;p&gt;mutt is running this lustre: tag 2.15.1_7.llnl, &lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="352253" author="cbordage" created="Wed, 9 Nov 2022 00:01:59 +0000"  >&lt;p&gt;Thank you for the updates with all the logs.&lt;/p&gt;

&lt;p&gt;I am not sure I got all what was going on&#8230; I will keep working on the logs. Would it be possible to patch your Lustre version to add more debug info?&lt;/p&gt;</comment>
                            <comment id="352681" author="ofaaland" created="Thu, 10 Nov 2022 21:44:06 +0000"  >&lt;p&gt;Hi Cyril,&lt;br/&gt;
Yes, we can patch Lustre to get more information.  If you push the patch to gerrit so it goes through the regular test cycle that would be ideal.&lt;br/&gt;
thanks&lt;/p&gt;</comment>
                            <comment id="354872" author="gerrit" created="Fri, 2 Dec 2022 00:40:49 +0000"  >&lt;p&gt;&quot;Cyril Bordage &amp;lt;cbordage@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49297&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49297&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16244&quot; title=&quot;after dropping router during ior lst add_group fails with &amp;quot;create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22&amp;quot;, lnetctl ping fails intermittently&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16244&quot;&gt;&lt;del&gt;LU-16244&lt;/del&gt;&lt;/a&gt; debug: add debug messages&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 4a22255e05d315ebfe140c929257d1e4567cf8a2&lt;/p&gt;</comment>
                            <comment id="356134" author="ofaaland" created="Tue, 13 Dec 2022 04:41:24 +0000"  >&lt;p&gt;Hi Cyril,&lt;/p&gt;

&lt;p&gt;For now, ignore &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/47330/47330_with-debug.try1.tar.bz2&quot; title=&quot;with-debug.try1.tar.bz2 attached to LU-16244&quot;&gt;with-debug.try1.tar.bz2&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; which I uploaded.&#160; I see now that following the same steps I&apos;m getting only a subset of the symptoms reported originally.&#160; The lustre servers have been updated with a newer version of Lustre since then, so I&apos;m going to take a look at the patches and see if that might make sense, and maybe revert the servers and try again.&lt;/p&gt;

&lt;p&gt;Olaf&lt;/p&gt;</comment>
                            <comment id="356592" author="defazio" created="Thu, 15 Dec 2022 20:22:18 +0000"  >&lt;p&gt;Hi Cyril,&lt;/p&gt;

&lt;p&gt;After attempting some additional IOR tests, I do see one of the same major symptom as before. That is, the ior job starts out with normal bandwidth, then stops and seems to have short periods of activity, but seems stuck (transferring 0 MB/s) most of the time. The job may or may not time out when in this condition. Anyways, it now looks like dropping a router did trigger the same, or at least very similar symptoms when Olaf did his test Tuesday, so you can go ahead and look at the files Olaf mentioned in the above post &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/47330/47330_with-debug.try1.tar.bz2&quot; title=&quot;with-debug.try1.tar.bz2 attached to LU-16244&quot;&gt;with-debug.try1.tar.bz2&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;.&lt;/p&gt;</comment>
                            <comment id="357061" author="cbordage" created="Wed, 21 Dec 2022 01:08:50 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;I analyzed the new logs and was not able to make a conclusion. The behavior is different on the router compared to before (not same error). Unfortunately, the router does not seem to have NET debug enabled.&lt;br/&gt;
Could you run your tests with NET debugging on all peers and the following patch &lt;a href=&quot;https://review.whamcloud.com/#/c/fs/lustre-release/+/47583/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/fs/lustre-release/+/47583/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="369833" author="ofaaland" created="Wed, 19 Apr 2023 00:14:37 +0000"  >&lt;p&gt;Cyril, adding patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50214/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50214/&lt;/a&gt; from ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; to our 2.15 patch stack seems to have fixed this issue based on early testing.  We have more testing in mind and will update the ticket when that&apos;s done.&lt;/p&gt;</comment>
                            <comment id="369866" author="cbordage" created="Wed, 19 Apr 2023 08:13:11 +0000"  >&lt;p&gt;Hello Olaf,&lt;/p&gt;

&lt;p&gt;thank you for the update.&lt;/p&gt;

&lt;p&gt;The unknown in the bug (here and in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt;) is what triggers it and why only on OPA. Do you have guess on that?&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="369928" author="ofaaland" created="Wed, 19 Apr 2023 16:27:24 +0000"  >&lt;p&gt;&amp;gt; The unknown in the bug (here and in&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt;) is what triggers it and why only on OPA. Do you have guess on that?&lt;/p&gt;

&lt;p&gt;Unfortunately, no.&#160; But we&apos;re continuing to try and understand, and we&apos;ll share anything we learn.&#160; Thanks&lt;/p&gt;</comment>
                            <comment id="370128" author="pjones" created="Fri, 21 Apr 2023 13:58:47 +0000"  >&lt;p&gt;Given that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; &#160;is already merged to b2_15 for the upcoming 2.15.3 release is there anything further to track here or can we close this ticket out?&lt;/p&gt;</comment>
                            <comment id="370157" author="ofaaland" created="Fri, 21 Apr 2023 16:40:07 +0000"  >&lt;p&gt;Peter, we&apos;re going to do a little more testing starting some time next week to satisfy ourselves that this is really fixed by patch 50214 and our successful tests weren&apos;t just &quot;bad luck&quot;.&#160; Once that&apos;s done, then I&apos;ll update the ticket.&lt;/p&gt;</comment>
                            <comment id="370161" author="pjones" created="Fri, 21 Apr 2023 16:53:41 +0000"  >&lt;p&gt;Sounds good - thanks Olaf&lt;/p&gt;</comment>
                            <comment id="372210" author="pjones" created="Sat, 13 May 2023 14:39:33 +0000"  >&lt;p&gt;Hey Olaf&lt;/p&gt;

&lt;p&gt;Just checking in to see whether you&apos;ve managed to complete sufficient testing to allow us to close out this ticket?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="373003" author="ofaaland" created="Fri, 19 May 2023 21:31:34 +0000"  >&lt;p&gt;Hi Peter, yes you can close this.&#160; We do not reproduce the issue with this patch.&lt;/p&gt;</comment>
                            <comment id="373004" author="pjones" created="Fri, 19 May 2023 21:41:07 +0000"  >&lt;p&gt;Very good - thanks Olaf!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="73420">LU-16349</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="46233" name="garter_1-8_debug_with_net.gz" size="29721166" author="defazio" created="Thu, 27 Oct 2022 20:27:35 +0000"/>
                            <attachment id="46232" name="mutt2-mutt18.tar.gz" size="5172035" author="defazio" created="Thu, 27 Oct 2022 20:15:11 +0000"/>
                            <attachment id="46234" name="mutt_1-4_debug_with_net.gz" size="15942476" author="defazio" created="Thu, 27 Oct 2022 20:27:20 +0000"/>
                            <attachment id="46235" name="mutt_10-11_17-18_27-28_30-31_debug_with_net.gz" size="5362894" author="defazio" created="Thu, 27 Oct 2022 20:27:08 +0000"/>
                            <attachment id="47330" name="with-debug.try1.tar.bz2" size="69803460" author="ofaaland" created="Tue, 13 Dec 2022 03:42:40 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0331z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>