<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:31:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3123] A client could not communicate with an OSS due to Timed out RDMA</title>
                <link>https://jira.whamcloud.com/browse/LU-3123</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One of clients could not communicate with one of OSSs due to Timed out RDMA error.  &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;kernel: LustreError: 9948:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds
kernel: LustreError: 9948:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.8.140@o2ib (32)
kernel: LustreError: 9948:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8104e84f4000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the server side, &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;kernel: LustreError: 19792:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds
kernel: LustreError: 19792:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.10.84@o2ib (18)
kernel: LustreError: 19789:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8101b15d4000
kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8103505f2b80
kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103505f2b80
kernel: LustreError: 21321:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(16384)  req@ffff8101419a3400 x1427389785285422/t0 o4-&amp;gt;b
fe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 448/416 e 0 to 0 dl 1364114285 ref 1 fl Interpret:/0/0 rc 0/0
kernel: Lustre: 21321:0:(ost_handler.c:1224:ost_brw_write()) share3-OST000e: ignoring bulk IO comm error with bfe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x
50000ac1a0a54_UUID id 12345-172.26.10.84@o2ib - client will retry
kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000f: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages
kernel: Lustre: 20014:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000c: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
kernel: Lustre: 19981:0:(ldlm_lib.c:874:target_handle_connect()) share3-OST000d: refuse reconnection from bfe63770-9dc7-fabc-fd87-625dec42ca0c@172.26.10.
84@o2ib to 0xffff81040d190c00; still busy with 1 active RPCs
kernel: LustreError: 19981:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810212b87000 x1427389785285592/t0 o8-&amp;gt;bfe63770
-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 368/264 e 0 to 0 dl 1364114377 ref 1 fl Interpret:/0/0 rc -16/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Ping (IBoIP) or ibping was okay, but &quot;lctl ping&quot; to this OSS was failing.  &lt;br/&gt;
&quot;lctl ping&quot; to other OSS was okay.  &lt;/p&gt;

&lt;p&gt;This issue was finally resolved by rebooting the client.  &lt;/p&gt;

&lt;p&gt;Would you please check if we can say this is a network issue or something wrong on the lustre side?  &lt;/p&gt;

&lt;p&gt;Attached is messages and debug log collected from the client and OSS.  &lt;/p&gt;

&lt;p&gt;Regards, &lt;/p&gt;</description>
                <environment></environment>
        <key id="18277">LU-3123</key>
            <summary>A client could not communicate with an OSS due to Timed out RDMA</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="mnishizawa">Mitsuhiro Nishizawa</reporter>
                        <labels>
                    </labels>
                <created>Mon, 8 Apr 2013 09:31:50 +0000</created>
                <updated>Wed, 2 Apr 2014 11:21:22 +0000</updated>
                            <resolved>Wed, 2 Apr 2014 11:21:22 +0000</resolved>
                                    <version>Lustre 1.8.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="55716" author="mnishizawa" created="Mon, 8 Apr 2013 09:34:28 +0000"  >&lt;p&gt;core dump was captured when this client was rebooted.  &lt;br/&gt;
As core dump itself is too much large, here is simple crash command output.  &lt;/p&gt;</comment>
                            <comment id="55726" author="bfaccini" created="Mon, 8 Apr 2013 12:30:31 +0000"  >&lt;p&gt;What is the network/IB topology you use at least to get this Clients and OSSs connected ? Also, did you run any pure IB troubleshooting/tool, like to see/extract if some of the error/stats counters increment on involved HCA/switches/boards in the Fabric ??&lt;/p&gt;
</comment>
                            <comment id="55746" author="bfaccini" created="Mon, 8 Apr 2013 15:14:37 +0000"  >&lt;p&gt;Attached crash-dump extracts show no hung/spinning thread which may indicate some LNET disfunction, and attached Client+OSS dmesg/Lustre-debuglog indicate flaky communications between OSS and Client causing multiple and recurring re-connections and recovery, since at least Sun Mar 24 03:39:09 PDT 2013, as far as I can find in the provided logs/infos.&lt;/p&gt;

&lt;p&gt;So definitely we need you to also investigate if no problem/errors are reported by your different IB-Fabric elements. BTW, OSS reports network errors with at least 2 Clients (172.26.10.84@o2ib, 172.26.12.181@o2ib) during the same period of time.&lt;/p&gt;
</comment>
                            <comment id="57222" author="bfaccini" created="Mon, 29 Apr 2013 09:49:01 +0000"  >&lt;p&gt;Hello Mitsuhiro,&lt;br/&gt;
Have you been able to troubleshoot you IB fabric as I indicated in my last update ??&lt;br/&gt;
Do you still have occurrence of problem/situation ??&lt;/p&gt;</comment>
                            <comment id="57283" author="mnishizawa" created="Tue, 30 Apr 2013 00:29:11 +0000"  >&lt;p&gt;Bruno, thanks for your comment.  The same kind of problem occurred again and our customer has replaced cables.  Currently, they are watching if this re-occur.  Please proceed to close this ticket at this time.  Thank you.  &lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12487" name="client_oss_log.tar.gz" size="237" author="mnishizawa" created="Mon, 8 Apr 2013 09:31:50 +0000"/>
                            <attachment id="12488" name="crash.log" size="565435" author="mnishizawa" created="Mon, 8 Apr 2013 09:34:28 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvnav:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7585</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>