<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:26:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16349] Excessive number of OPA disconnects / LNET network errors in cluster</title>
                <link>https://jira.whamcloud.com/browse/LU-16349</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.&lt;/p&gt;

&lt;p&gt;We suspected OPA first, but couldn&apos;t find any problems with RDMA when used otherwise (e.g. MPI).&lt;/p&gt;

&lt;p&gt;The problem has been ongoing for a long time and is completely mysterious to us.&lt;/p&gt;

&lt;p&gt;Typically when the issue appears, kernel messages of the kind shown below appear.&lt;/p&gt;

&lt;p&gt;OPA counters do not show any errors and according to the customer they don&apos;t see network problems for compute (MPI, etc.)&lt;/p&gt;

&lt;p&gt;It doesn&apos;t seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; is not sufficient.&lt;/p&gt;

&lt;p&gt;What can we do to resolve the problem?&lt;/p&gt;

&lt;p&gt;Server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800
[3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff9115bbdb2050 x1739338816146624/t0(0) o4-&amp;gt;2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0
[3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800
...
3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800
[3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125
[3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages
[3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400
...
[3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400
[3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff91154f9b4850 x1739338816563968/t0(0) o4-&amp;gt;2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0
[3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages
[3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00
[3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Client:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee
[5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251]  req@00000000d70c7694 x1739323570965888/t0(0) o3-&amp;gt;work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
[5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
[5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
[5454253.635574] Lustre: Skipped 36 previous similar messages
[5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1)
[5454253.652508] Lustre: Skipped 37 previous similar messages
[5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1
[5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages
[5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15
[5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
[5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions&lt;br/&gt;
Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)</environment>
        <key id="73420">LU-16349</key>
            <summary>Excessive number of OPA disconnects / LNET network errors in cluster</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="cbordage">Cyril Bordage</assignee>
                                    <reporter username="omangold">Oliver Mangold</reporter>
                        <labels>
                    </labels>
                <created>Tue, 29 Nov 2022 07:26:01 +0000</created>
                <updated>Mon, 1 May 2023 23:14:06 +0000</updated>
                            <resolved>Tue, 14 Feb 2023 16:10:48 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                    <fixVersion>Lustre 2.15.3</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="358503" author="cbordage" created="Tue, 10 Jan 2023 21:26:17 +0000"  >&lt;p&gt;%% Just a copy of what Holger reported %%&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I will share a theory from cornelis here, which is not yet confirmed&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;If you are willing, I have a failure theory I&#8217;d like you to test.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;First, the theory:&lt;/p&gt;

&lt;p&gt;-----------------------&lt;/p&gt;

&lt;p&gt;Background:&#160; When an lnet pool mr is created, rdmavt assigns a key to the mr.&#160; When lnet maps a mr, it (1) creates an invalidate wr with the current mr key then increments (changes) the key in the mr, (2) creates a wr that does a fast memory register with the updated key.&#160; A fast memory register will change the key rdmavt remembers for the mr.&#160; The intent is as follows: On first use, the two created wrs are pre-pended to the operation using the mr (e.g. a read or write).&#160; The two wr, together, will invalidate the original mr and fast register the mr with a new key.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Failure scenario: A pool mr is mapped, then unmapped without being used.&#160; If this is ever done, the keys in the mr and rdmavt for the memory region are forever out of sync because the invalidate and re-register never occurred for rdmavt but the key increment in the mr &lt;b&gt;did&lt;/b&gt; occur.&#160; The next time the mr is mapped, the created invalidate will have a key that does not match rdmavt&apos;s version of the key.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Second, the test:&lt;/p&gt;

&lt;p&gt;-----------------------&lt;/p&gt;

&lt;p&gt;Proposed test plan for my theory:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Add debugging:&lt;/li&gt;
	&lt;li&gt;kiblnd_fmr_pool_unmap() :&#160; Just before where fmr_frd is set to false, check its value.&#160; If it is already false, then the mr was mapped and now unmapped without being used.&#160; This mr has entered into the suspect error condition.&#160; Print the mr pointer, frd-&amp;gt;frd_mr, for comparison purposes.&lt;/li&gt;
	&lt;li&gt;kiblnd_post_tx_locked():&#160; If the call to ib_post_send() fails and frd was used, print out the value of the failing mr address, frd-&amp;gt;frd_mr, for comparison purposes.&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;Test steps:&lt;/li&gt;
	&lt;li&gt;On the Lustre client, unload and reload o2iblnd to make sure all pool mrs are good.&lt;/li&gt;
	&lt;li&gt;Add debugging as described above.&#160; One of (a) compiled into o2iblnd, (b) kprobe, ( c ) ktrace, (d) systemtap&lt;/li&gt;
	&lt;li&gt;Recreate the failure in kiblnd_post_tx_locked(): ib_post_send() returns an error, -22.&lt;/li&gt;
	&lt;li&gt;Compare the mr address printed in the kiblnd_post_tx_locked() debugging with any mr addresses that were printed in kiblnd_fmr_pool_unmap() due to not being used after being mapped.&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;Notes:&lt;/li&gt;
	&lt;li&gt;If there are any prints from kiblnd_fmr_pool_unmap(), then, IMO, there is a problem.&lt;/li&gt;
	&lt;li&gt;It would be best if the debugging for rdmavt also be used for additional failure key information.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I have attached a patch for o2iblnd that adds debugging prints.&#160; The code is uncompiled and untested.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;-Dean&lt;/p&gt;</comment>
                            <comment id="358504" author="cbordage" created="Tue, 10 Jan 2023 21:31:29 +0000"  >&lt;p&gt;Could someone provide the proposed patch to check it?&lt;/p&gt;</comment>
                            <comment id="358516" author="JIRAUSER18439" created="Tue, 10 Jan 2023 22:09:42 +0000"  >&lt;p&gt;I have attached the debug patch I sent to Holger.&#160; The file is no-post.patch.gz.&lt;/p&gt;</comment>
                            <comment id="359383" author="JIRAUSER18439" created="Tue, 17 Jan 2023 21:11:55 +0000"  >&lt;p&gt;I am unable to force a map-unmap sequence using a 2-node setup and lnetctl.&#160; The result was a kernel BUG.&#160; See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16484&quot; title=&quot;Linux kernel BUG when deleting and adding a peer and using a filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16484&quot;&gt;LU-16484&lt;/a&gt; for details.&lt;/p&gt;

&lt;p&gt;I have added more o2iblnd debugging, including a patch that will perform a pool mr map-unmap after a given number of uses.&lt;/p&gt;

&lt;p&gt;Here is a mr being forced to map-unmap without being posted:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646500, is_rx 1, count 100
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_map_tx: forcing map-unmap on mr 0xffff948ecf043800
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646501, posted 0
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Line 4 is the debug line that I sent to NEC. The other lines are extra debug lines I added locally to print more information. Line 1 shows that this is the 100th pool mr map - a call to both kiblnd_fmr_map_tx() and kiblnd_fmr_pool_map(). Line 2 is the commentary that I am forcing the map-unmap condition on a pool mr. On line 3, note that the rkey is incremented from its original value in line 1. This is the the issue/problem that occurs when unmapped without being posted - the mr key is changed in o2iblnd, but rdmavt still holds the original key.&lt;/p&gt;

&lt;p&gt;I then had to wait for the same pool mr to be re-used. Here is what happened on re-use:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646501, is_rx 1, count 1124
Jan 12 10:51:11 opahsx121 kernel: kiblnd_post_tx_locked: mr 0xffff948ecf043800, posted 0, err -22, send failure
Jan 12 10:51:11 opahsx121 kernel: LNetError: 10773:0:(o2iblnd_cb.c:1054:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.72.174@o2ib
Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646502, posted 0
Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted
Jan 12 10:51:11 opahsx121 kernel: LustreError: 10773:0:(events.c:205:client_bulk_callback()) event type 2, status -22, desc 0000000010390fca
Jan 12 10:51:11 opahsx121 kernel: Lustre: 13263:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1673538671/real 1673538671] req@00000000b9ba836a x1754831869785472/t0(0) o37-&amp;gt;temp-MDT0000-mdc-ffff948000960000@192.168.72.174@o2ib:23/10 lens 448/440 e 0 to 1 dl 1673538678 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection to temp-MDT0000 (at 192.168.72.174@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection restored to 192.168.72.174@o2ib (at 192.168.72.174@o2ib) &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Lines 1, 4, and 5 are my debugging prints. On line 1 note that the count is now 1124. I had to wait 1024 pool mr uses before my victim mr was reused. Line 2 is the failure reported by NEC/ULM that was traced to an key mismatch. On line 4 note that the rkey is again incremented. In other words it stays wrong and will continue to be wrong until the user portion of the key (8 bits) rolls over and re-matches what rdmavt expects. Line 5, again, is the debug print sent to NEC.&lt;/p&gt;

&lt;p&gt;On the server, this is what I see:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jan 12 10:51:11 opahsx174 kernel: LustreError: 30188:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9d2ab07c6400
Jan 12 10:51:11 opahsx174 kernel: Lustre: temp-MDT0000: Client c4c295b0-1df0-37a4-3ada-e906e22cb42e (at 192.168.72.121@o2ib) reconnecting
Jan 12 10:51:12 opahsx174 kernel: LustreError: 11726:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff9d2ba987d050 x1754831869785472/t0(0) o37-&amp;gt;c4c295b0-1df0-37a4-3ada-e906e22cb42e@192.168.72.121@o2ib:392/0 lens 448/440 e 0 to 0 dl 1673538677 ref 1 fl Interpret:/0/0 rc 0/0
Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Client 5188c476-a2eb-f754-b30a-890947140fdf (at 192.168.72.121@o2ib) reconnecting
Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 1 previous similar message
Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Connection restored to 0be70565-3123-a02c-eb27-972e46a2c761 (at 192.168.72.121@o2ib)
Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 4 previous similar messages &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Lines 1-3 match the time of the client failure. Lines 4-7 are 19 minutes later. I don&#8217;t know if the latter lines are related. Line 1 matches one of the server error lines reported by NEC/ULM.&lt;/p&gt;

&lt;p&gt;Summary:&lt;/p&gt;

&lt;p&gt;I think this demonstrates that there is a bug in o2ibnld when a pool mr is mapped and unmapped without a post. When this occurs, we see messages similar to those seen by NEC/ULM. What this doesn&#8217;t demonstrate is this actually happening at NEC/ULM. The demonstration above was forced by an explicit coding change that simulates an error condition what may or may not be possible in practice.&lt;/p&gt;</comment>
                            <comment id="359384" author="JIRAUSER18439" created="Tue, 17 Jan 2023 21:18:56 +0000"  >&lt;p&gt;The extra debugging and forced map-unmap are two new patches on top of the original patch sent to NEC/ULM and previously attached in this ticket.&#160; I have saved all as a patch series and attached that here as file o2iblnd-debug.tar.gz.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/47707/47707_o2iblnd-debug.tar.gz&quot; title=&quot;o2iblnd-debug.tar.gz attached to LU-16349&quot;&gt;o2iblnd-debug.tar.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="359452" author="cbordage" created="Wed, 18 Jan 2023 10:41:09 +0000"  >&lt;p&gt;Hello Dean,&lt;/p&gt;

&lt;p&gt;thank you for this demonstration. It is pretty clear.&lt;/p&gt;

&lt;p&gt;You are right, this shows some problem, but we cannot be sure it is the one. I do not see why this issue was happening in this machine again and again, and not in other sites. I will focus on writing a patch, and will get back to you if I cannot manage to test it with your debugging process.&lt;/p&gt;

&lt;p&gt;My concern is not to know if the patch is responsible for avoiding new crashes. When the patch will be accepted, would it be possible not to patch all clients to see if it happens again on unpatched nodes?&lt;/p&gt;</comment>
                            <comment id="359489" author="JIRAUSER18439" created="Wed, 18 Jan 2023 15:32:28 +0000"  >&lt;p&gt;I have attached a proposed fix for the issue.&#160; I have tested it and it allows i2iblnd to correctly ride through a map-unmap without a key mismatch.&#160; This change is &quot;minimal&quot; in that it changes only enough to avoid the problem.&#160; This patch sits on top of the debugging and forced map-unmap patches - there may be some hand editing needed when applying without the debug patches.&#160; The changes are simple enough that this should not be a problem.&#160; The patch file is minimal-fix.patch.gz.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/47732/47732_minimal-fix.patch.gz&quot; title=&quot;minimal-fix.patch.gz attached to LU-16349&quot;&gt;minimal-fix.patch.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="359774" author="gerrit" created="Thu, 19 Jan 2023 21:42:01 +0000"  >&lt;p&gt;&quot;Cyril Bordage &amp;lt;cbordage@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49714&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49714&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: eb697083c32d19dbdb0e0f38d2bbef183fc113bf&lt;/p&gt;</comment>
                            <comment id="359775" author="cbordage" created="Thu, 19 Jan 2023 21:44:06 +0000"  >&lt;p&gt;Thank you, Dean. Your patch seemed fine, so I pushed it.&lt;/p&gt;</comment>
                            <comment id="360176" author="cbordage" created="Tue, 24 Jan 2023 13:33:24 +0000"  >&lt;p&gt;Just a quick update.&lt;/p&gt;

&lt;p&gt;Waiting for the patch to be reviewed. It should be done soon. Then, I will backport it.&lt;/p&gt;</comment>
                            <comment id="360622" author="cbordage" created="Fri, 27 Jan 2023 09:47:55 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;could you post the details of the last bug with all logs here.&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="361088" author="JIRAUSER18439" created="Tue, 31 Jan 2023 17:18:12 +0000"  >&lt;p&gt;At NEC&apos;s request, Ulm has been running a modified version of Lustre 2.12.9 on the client side.&#160; Known changes applied by NEC are attached to ticket as:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;o2iblnd-debug.tar.gz (includes no-post.patch.gz)
minimal-fix.patch.gz
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;NEC sent the journal log of n0802 which includes the debugging output.&#160; A review of the log file shows:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Several cases of a mr being mapped then unmapped without being posted.&#160; These same mr&apos;s were successfully used later.&lt;/li&gt;
	&lt;li&gt;There were no cases of kiblnd_post_tx_locked() returning Error -22 - a sign of the rkey mismatch.&lt;/li&gt;
	&lt;li&gt;There were other Lustre issues seen within the journal log.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;In summary, I think the Cornelis proposed MR patch fixes the observed rkey mismatch issue.&lt;/p&gt;</comment>
                            <comment id="361092" author="JIRAUSER18439" created="Tue, 31 Jan 2023 17:35:42 +0000"  >&lt;p&gt;Ulm reported node crashes last weekend.&#160; Here is a review of the set of 6 BMC console logs sent by NEC to Cornelis.&lt;/p&gt;

&lt;p&gt;Summary:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;5 of the 6 logs show a watchdog NMI following a LNet error.&#160; The 6th log shows no fatal error - it looks like a normal start and shutdown.&lt;/li&gt;
	&lt;li&gt;All fatal NMI watchdog timeouts are preceded by the same LNet error.&lt;/li&gt;
	&lt;li&gt;4 of the 5 NMI tracebacks are the same.&#160; The call stack includes the modules ptlrpc, lnet, ko2iblnd.&lt;/li&gt;
	&lt;li&gt;1 of the 5 NMI tracebacks is slightly different.&#160; This is on a kernel workqueue thread.&#160; The call stack includes the modules ib_cm, rdma_cm, ko2iblnd.&lt;/li&gt;
	&lt;li&gt;None of the traces involve hfi1 or rdmavt.&lt;/li&gt;
	&lt;li&gt;The NMI watchdog failures indicate deadlock of some form.&lt;/li&gt;
	&lt;li&gt;My tentative conclusion: No evidence of an OPA issue at this time.&lt;/li&gt;
	&lt;li&gt;My guess, with no evidence, is that the NMI is due to a code error path that contains a circular lock dependency.&#160; The cause of the original error is unknown.&lt;/li&gt;
	&lt;li&gt;Alternative theory: There is a rare lockup that eventually triggers the LNet timeout error, then later the NMI watchdog.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n0428-bmc&lt;/p&gt;

&lt;p&gt;There is an LNet error in the log then 6 seconds later a Watchdog NMI.&#160; This is followed 1 second later with a fatal NMI.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[569811.333835] LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[569830.918525] NMI watchdog: Watchdog detected hard LOCKUP on cpu 50
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then 1 second later, another NMI, this one fatal, on the same CPU.&#160;&#160; Same traceback, although obscured by the additional traceback path through the NMI.&lt;/p&gt;

&lt;p&gt;Other observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;None of the tracebacks have hfi or rdmavt in them&lt;/li&gt;
	&lt;li&gt;The tracebacks &lt;b&gt;do&lt;/b&gt; have functions in ib_cm, rdma_cm, and ko2iblnd&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This is running on&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[569830.918544] Workqueue: ib_cm cm_work_handler [ib_cm]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;First traceback:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[569830.918552] Call Trace:
[569830.918552]&#160; queued_write_lock_slowpath+0x75/0x80
[569830.918553]&#160; _raw_write_lock_irqsave+0x2b/0x30
[569830.918553]&#160; kiblnd_close_conn+0x1b/0x40 [ko2iblnd]
[569830.918554]&#160; kiblnd_cm_callback+0x9e0/0x2270 [ko2iblnd]
[569830.918554]&#160; cma_cm_event_handler+0x25/0xd0 [rdma_cm]
[569830.918554]&#160; cma_ib_handler+0xa7/0x2e0 [rdma_cm]
[569830.918555]&#160; cm_process_work+0x22/0xf0 [ib_cm]
[569830.918555]&#160; cm_work_handler+0xa77/0x1410 [ib_cm]
[569830.918555]&#160; ? __switch_to_asm+0x41/0x70
[569830.918556]&#160; ? __switch_to_asm+0x35/0x70
[569830.918556]&#160; ? __switch_to_asm+0x41/0x70
[569830.918556]&#160; ? __switch_to_asm+0x35/0x70
[569830.918557]&#160; ? __switch_to_asm+0x41/0x70
[569830.918557]&#160; ? __switch_to_asm+0x35/0x70
[569830.918557]&#160; ? __switch_to_asm+0x41/0x70
[569830.918558]&#160; ? __switch_to+0x167/0x470
[569830.918558]&#160; ? finish_task_switch+0xaa/0x2e0
[569830.918558]&#160; process_one_work+0x1a7/0x360
[569830.918559]&#160; ? create_worker+0x1a0/0x1a0
[569830.918559]&#160; worker_thread+0x30/0x390
[569830.918559]&#160; ? create_worker+0x1a0/0x1a0
[569830.918559]&#160; kthread+0x116/0x130
[569830.918560]&#160; ? kthread_flush_work_fn+0x10/0x10
[569830.918560]&#160; ret_from_fork+0x1f/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n0501-bmc&lt;/p&gt;

&lt;p&gt;There is an LNet error in the log then 6 seconds later a Watchdog NMI.&#160; This is followed 1 second later with a fatal NMI.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[638125.931425] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[638141.798616] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Different traceback than the first node:&lt;/li&gt;
	&lt;li&gt;no connection manager calls&lt;/li&gt;
	&lt;li&gt;not on a worker thread&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;First Traceback:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[638141.798641] Call Trace:
[638141.798641]&#160; queued_read_lock_slowpath+0x6e/0x80
[638141.798642]&#160; _raw_read_lock_irqsave+0x31/0x40
[638141.798642]&#160; kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[638141.798642]&#160; ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[638141.798643]&#160; kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[638141.798643]&#160; lnet_ni_send+0x42/0xd0 [lnet]
[638141.798643]&#160; lnet_send+0x7e/0x1b0 [lnet]
[638141.798644]&#160; LNetPut+0x2b7/0xaf0 [lnet]
[638141.798644]&#160; ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[638141.798644]&#160; ptl_send_buf+0x20b/0x560 [ptlrpc]
[638141.798645]&#160; ptl_send_rpc+0x462/0xda0 [ptlrpc]
[638141.798645]&#160; ? __switch_to_asm+0x41/0x70
[638141.798645]&#160; ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[638141.798646]&#160; ? __switch_to_asm+0x35/0x70
[638141.798646]&#160; ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[638141.798646]&#160; ? __switch_to+0x167/0x470
[638141.798647]&#160; ? finish_task_switch+0xaa/0x2e0
[638141.798647]&#160; ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[638141.798647]&#160; ptlrpcd+0x374/0x4b0 [ptlrpc]
[638141.798648]&#160; ? wake_up_q+0x80/0x80
[638141.798648]&#160; ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[638141.798648]&#160; kthread+0x116/0x130
[638141.798649]&#160; ? kthread_flush_work_fn+0x10/0x10
[638141.798649]&#160; ret_from_fork+0x1f/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n0515-bmc&lt;/p&gt;

&lt;p&gt;There is an LNet error in the log then 6 seconds later a Watchdog NMI.&#160; This is followed 1 second later with a fatal NMI.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[39917.837921] LNetError: 2365:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[39930.022676] NMI watchdog: Watchdog detected hard LOCKUP on cpu 66
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;This traceback matches the second node.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;First traceback:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[39930.022703] Call Trace:
[39930.022703]&#160; queued_read_lock_slowpath+0x6e/0x80
[39930.022704]&#160; _raw_read_lock_irqsave+0x31/0x40
[39930.022704]&#160; kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[39930.022704]&#160; ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[39930.022705]&#160; kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[39930.022705]&#160; lnet_ni_send+0x42/0xd0 [lnet]
[39930.022705]&#160; lnet_send+0x7e/0x1b0 [lnet]
[39930.022706]&#160; LNetPut+0x2b7/0xaf0 [lnet]
[39930.022706]&#160; ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[39930.022706]&#160; ptl_send_buf+0x20b/0x560 [ptlrpc]
[39930.022707]&#160; ptl_send_rpc+0x462/0xda0 [ptlrpc]
[39930.022707]&#160; ? __switch_to_asm+0x41/0x70
[39930.022707]&#160; ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[39930.022708]&#160; ? __switch_to_asm+0x35/0x70
[39930.022708]&#160; ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[39930.022708]&#160; ? __switch_to+0x167/0x470
[39930.022708]&#160; ? finish_task_switch+0xaa/0x2e0
[39930.022709]&#160; ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[39930.022709]&#160; ptlrpcd+0x374/0x4b0 [ptlrpc]
[39930.022709]&#160; ? wake_up_q+0x80/0x80
[39930.022710]&#160; ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[39930.022710]&#160; kthread+0x116/0x130
[39930.022710]&#160; ? kthread_flush_work_fn+0x10/0x10
[39930.022711]&#160; ret_from_fork+0x1f/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n0601-bmc&lt;/p&gt;

&lt;p&gt;There is an LNet error in the log then 24 seconds later a Watchdog NMI.&#160; This is followed 1 second later with a fatal NMI.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[576238.943801] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[576262.045550] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12Modules linked in: mgc(OE) lustre(OE)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Same traceback as nodes #2 and #3 (this is #4).&lt;/li&gt;
	&lt;li&gt;So far, only dump 1 is different (on a worker thread, different call stack).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;First traceback:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[576262.045570] Call Trace:
[576262.045570]&#160; _raw_read_lock_irqsave+0x31/0x40
[576262.045570]&#160; kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[576262.045571]&#160; ? cfs_percpt_lock+0x16/0xf0 [libcfs]
[576262.045571]&#160; kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[576262.045571]&#160; lnet_ni_send+0x42/0xd0 [lnet]
[576262.045571]&#160; lnet_send+0x7e/0x1b0 [lnet]
[576262.045572]&#160; lnet_finalize+0x41d/0xf90 [lnet]
[576262.045572]&#160; kiblnd_tx_done+0x119/0x2f0 [ko2iblnd]
[576262.045572]&#160; ? kiblnd_handle_completion+0xc6/0x1a0 [ko2iblnd]
[576262.045573]&#160; kiblnd_handle_rx+0x38a/0x680 [ko2iblnd]
[576262.045573]&#160; kiblnd_scheduler+0x1036/0x10b0 [ko2iblnd]
[576262.045573]&#160; ? wake_up_q+0x80/0x80
[576262.045573]&#160; ? __switch_to_asm+0x41/0x70
[576262.045574]&#160; ? kiblnd_cq_event+0x80/0x80 [ko2iblnd]
[576262.045574]&#160; kthread+0x116/0x130
[576262.045574]&#160; ? kthread_flush_work_fn+0x10/0x10
[576262.045575]&#160; ret_from_fork+0x1f/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n0922-bmc&lt;/p&gt;

&lt;p&gt;There is an LNet error in the log then 16 seconds later a Watchdog NMI.&#160; This is followed 1 second later with a fatal NMI.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[52223.817213] LNetError: 2353:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[52239.190601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 71
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Same traceback as nodes #2, #3, and #4 (this is #5).&lt;/li&gt;
	&lt;li&gt;So far, only dump 1 is different (on a worker thread, different call stack).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;First call trace:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[52239.190627] Call Trace:
[52239.190627]&#160; queued_read_lock_slowpath+0x6e/0x80
[52239.190628]&#160; _raw_read_lock_irqsave+0x31/0x40
[52239.190628]&#160; kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[52239.190628]&#160; ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[52239.190629]&#160; kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[52239.190629]&#160; lnet_ni_send+0x42/0xd0 [lnet]
[52239.190629]&#160; lnet_send+0x7e/0x1b0 [lnet]
[52239.190629]&#160; LNetPut+0x2b7/0xaf0 [lnet]
[52239.190630]&#160; ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[52239.190630]&#160; ptl_send_buf+0x20b/0x560 [ptlrpc]
[52239.190630]&#160; ptl_send_rpc+0x462/0xda0 [ptlrpc]
[52239.190631]&#160; ? __switch_to_asm+0x41/0x70
[52239.190631]&#160; ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[52239.190631]&#160; ? __switch_to_asm+0x35/0x70
[52239.190632]&#160; ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[52239.190632]&#160; ? __switch_to+0x167/0x470
[52239.190633]&#160; ? finish_task_switch+0xaa/0x2e0
[52239.190633]&#160; ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[52239.190633]&#160; ptlrpcd+0x374/0x4b0 [ptlrpc]
[52239.190634]&#160; ? wake_up_q+0x80/0x80
[52239.190634]&#160; ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[52239.190634]&#160; kthread+0x116/0x130
[52239.190635]&#160; ? kthread_flush_work_fn+0x10/0x10
[52239.190635]&#160; ret_from_fork+0x1f/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Console: n1618-bmc&lt;/p&gt;

&lt;p&gt;No NMI or tracebacks.&#160; I see:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Normal boot output.&lt;/li&gt;
	&lt;li&gt;One lustre error.&#160; Not like the final errors seen in the other log files examined.&lt;/li&gt;
	&lt;li&gt;Repeated messages for CPU throttling due to temperature above threshold.&lt;/li&gt;
	&lt;li&gt;At the end, a controlled shutdown.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="361229" author="gerrit" created="Wed, 1 Feb 2023 16:58:04 +0000"  >&lt;p&gt;&quot;Cyril Bordage &amp;lt;cbordage@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49864&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49864&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 15eb62f2ef0c24a91fee049ef5d324cc40074a80&lt;/p&gt;</comment>
                            <comment id="361280" author="cbordage" created="Wed, 1 Feb 2023 21:58:42 +0000"  >&lt;p&gt;Which version of Lustre it is? I thought it was 2.12.9 with the 4 given patches.&lt;/p&gt;

&lt;p&gt;In log, we have &quot;o2iblnd_cb.c:3442&quot;. But from the code, I should have &quot;o2iblnd_cb.c:3443&quot;.&lt;/p&gt;</comment>
                            <comment id="361318" author="omangold" created="Thu, 2 Feb 2023 08:15:56 +0000"  >&lt;p&gt;It was 2.12.9 with the patches from Dean. There were quite some changes (debug output) before line 3442, so I thing one-off is plausible.&lt;/p&gt;</comment>
                            <comment id="361322" author="cbordage" created="Thu, 2 Feb 2023 09:00:11 +0000"  >&lt;p&gt;I applied all Dean&apos;s patches, so we should have exactly the same line numbers. I wanted to be sure, we are working on the same code base. In my code, &quot;CWARN&quot; call is from line 3440 to 3443. What lines do you have? Could you provide the patched files for me to compare?&lt;/p&gt;

&lt;p&gt;In the meantime, I will keep trying to find a path that shows a lock problem.&lt;/p&gt;</comment>
                            <comment id="361336" author="omangold" created="Thu, 2 Feb 2023 10:19:42 +0000"  >&lt;p&gt;I just attached to the ticket the patches to be applied on top of 2.12.9.&lt;/p&gt;</comment>
                            <comment id="361427" author="cbordage" created="Thu, 2 Feb 2023 21:01:30 +0000"  >&lt;p&gt;Thank you, Oliver. I did not know Serguei&apos;s patch was also deployed.&lt;/p&gt;

&lt;p&gt;Here is my analysis of the timeout errors.&lt;br/&gt;
The error message from o2iblnd_cb.c:3442 (1), should be followed by another error message in o2iblnd_cb.c:3517 (2). But there is nothing before watchdog message...&lt;br/&gt;
Between the two messages, there is only function returns and a test of a variable. It seems the machine is blocked somehow.&lt;br/&gt;
When the code runs normally, the mutex is unlocked right away after message 2.&lt;/p&gt;

&lt;p&gt;In n0428-31-01-2023-journal.log, we can see the two messages:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jan 30 16:08:36 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z E! [agent] Error killing process: os: process already finished
Jan 30 16:08:55 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3517:kiblnd_check_conns()) Timed out RDMA with 10.4.104.103@o2ib1 (19): c: 27, oc: 0, rc: 32&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But they are separated by 19 seconds, with some error in the middle saying that collection for telegraf was too long. That seems to confirm what we saw in the previous error.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The flow of debug messages could be very high (more than 1000/s). We will see with the new patch if the problem disappears.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="361470" author="gerrit" created="Fri, 3 Feb 2023 06:48:45 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49714/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49714/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 0c93919f1375ce16d42ea13755ca6ffcc66b9969&lt;/p&gt;</comment>
                            <comment id="361723" author="cbordage" created="Mon, 6 Feb 2023 15:56:09 +0000"  >&lt;p&gt;Hello Oliver,&lt;/p&gt;

&lt;p&gt;Was the patch deployed?&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="361841" author="bschaefer" created="Tue, 7 Feb 2023 09:38:52 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;pachted client is actual in rollout on cluster.&lt;/p&gt;

&lt;p&gt;~250 nodes are running with new lustre client now.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;best regards&lt;/p&gt;

&lt;p&gt;Benedikt&lt;/p&gt;</comment>
                            <comment id="362743" author="bschaefer" created="Tue, 14 Feb 2023 14:25:36 +0000"  >&lt;p&gt;actual state:&lt;/p&gt;

&lt;p&gt;~600 clients with patched client&lt;/p&gt;

&lt;p&gt;we still see on some clients lustre errors, but nodes were recovered and not stuck&lt;/p&gt;</comment>
                            <comment id="362775" author="pjones" created="Tue, 14 Feb 2023 16:10:48 +0000"  >&lt;p&gt;Fix landed for 2.16 so closing ticket. Should port fix to b2_15 for inclusion in a future 2.15.x release.&lt;/p&gt;</comment>
                            <comment id="364934" author="gerrit" created="Mon, 6 Mar 2023 03:01:30 +0000"  >&lt;p&gt;&quot;Xing Huang &amp;lt;hxing@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50214&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50214&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8811fa2cd9db66d4acdabf17e2cc93ceef5f6752&lt;/p&gt;</comment>
                            <comment id="368731" author="gerrit" created="Thu, 6 Apr 2023 23:55:27 +0000"  >&lt;p&gt;&quot;Gian-Carlo DeFazio &amp;lt;defazio1@llnl.gov&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50564&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50564&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 329448005cefab9fe13da33c50a51f664a8cce66&lt;/p&gt;</comment>
                            <comment id="368982" author="gerrit" created="Tue, 11 Apr 2023 00:06:52 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/50214/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/50214/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16349&quot; title=&quot;Excessive number of OPA disconnects / LNET network errors in cluster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16349&quot;&gt;&lt;del&gt;LU-16349&lt;/del&gt;&lt;/a&gt; o2iblnd: Fix key mismatch issue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: da98e1a6f31462ab76ce7c3a48c21eb4c9eda151&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="72826">LU-16244</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="74046">LU-16484</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="47732" name="minimal-fix.patch.gz" size="1660" author="luick" created="Wed, 18 Jan 2023 15:30:48 +0000"/>
                            <attachment id="47621" name="no-post.patch.gz" size="715" author="luick" created="Tue, 10 Jan 2023 22:08:49 +0000"/>
                            <attachment id="47707" name="o2iblnd-debug.tar.gz" size="1947" author="luick" created="Tue, 17 Jan 2023 21:18:48 +0000"/>
                            <attachment id="47964" name="testing-patches-230119.patch" size="29268" author="omangold" created="Thu, 2 Feb 2023 10:18:51 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i036p3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>