<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:10:58 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7676] OSS Servers stuck in connecting/disconnect loop</title>
                <link>https://jira.whamcloud.com/browse/LU-7676</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have had several OSS started to get into a state of disconnect and reconnect with clients. Sometimes they clear-up and then re-enter the same state later. Even with reboot the will enter into the same state.&lt;/p&gt;

&lt;p&gt;Attaching Lustre Debug dump. Please advice on what additional info is need for debugging.&lt;/p&gt;</description>
                <environment></environment>
        <key id="34143">LU-7676</key>
            <summary>OSS Servers stuck in connecting/disconnect loop</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Sat, 16 Jan 2016 02:28:22 +0000</created>
                <updated>Thu, 22 Sep 2016 22:38:22 +0000</updated>
                            <resolved>Thu, 22 Sep 2016 22:38:22 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="139113" author="mhanafi" created="Sat, 16 Jan 2016 03:36:37 +0000"  >&lt;p&gt;uploaded the following files to ftp.whamcloud.com/uploads/LU7676/&lt;/p&gt;

&lt;p&gt;out.nbp2-oss20.1452914508.gz&lt;br/&gt;
out.nbp2-oss20.1452913997.gz&lt;br/&gt;
out.nbp2-oss18.1452914592.gz&lt;br/&gt;
out.nbp2-oss18.1452914592.gz&lt;br/&gt;
varlogmessages.gz&lt;/p&gt;</comment>
                            <comment id="139116" author="mhanafi" created="Sat, 16 Jan 2016 04:16:17 +0000"  >&lt;p&gt;Not sure why some if the field where blank....&lt;br/&gt;
Effected Version: 2.5.3 Server.&lt;/p&gt;</comment>
                            <comment id="139117" author="green" created="Sat, 16 Jan 2016 05:49:56 +0000"  >&lt;p&gt;Hm, this is a strange message in the servers:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 15 19:13:24 nbp2-oss18 kernel: LustreError: 13505:0:(events.c:452:server_bulk_callback()) event type 5, status -103, desc ffff881b1a708000
Jan 15 19:13:24 nbp2-oss18 kernel: Lustre: nbp2-OST0075: Bulk IO read error with 9a6a5394-9d0c-107d-b924-82de647f4613 (at 10.151.27.95@o2ib), client will retry: rc -110
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So something causes those bulk transfers to get aborted. rc -110 is also etimeout. (-103 is connection aborted).&lt;/p&gt;

&lt;p&gt;And then this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 15 19:13:24 nbp2-oss18 kernel: Lustre: 21888:0:(service.c:2050:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (192:4547s); client may timeout.  req@ffff881b110a2400 x1522760681448372/t0(0) o3-&amp;gt;9a6a5394-9d0c-107d-b924-82de647f4613@10.151.27.95@o2ib:0/0 lens 488/432 e 0 to 0 dl 1452909414 ref 1 fl Complete:/0/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So this means server threads are spending lots of time processing this request, o3 is READ, so all those bulk timeouts are probably causing read RPCs to fail, and take long time at that. When that happens, the client whose request got stuck like that would be complaining about server unresponsiveness and will be reconnecting.&lt;/p&gt;

&lt;p&gt;So the root cause is somewhere in the bulk IO errors.&lt;/p&gt;

&lt;p&gt;In the log attached to this ticket we can see stuff like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;00000800:00000100:4.0F:1452910553.468033:0:2802:0:(o2iblnd_cb.c:2903:kiblnd_cm_callback()) 10.151.54.85@o2ib: UNREACHABLE -110
00000800:00000100:4.0:1452910553.563037:0:2802:0:(o2iblnd_cb.c:2903:kiblnd_cm_callback()) 10.151.0.196@o2ib: UNREACHABLE -110
00000800:00000100:4.0:1452910553.563045:0:2802:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting messages &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.151.0.196@o2ib: connection failed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Assuming this is one of the clients, I imagine you are just having some sort of a network problem where some of the messages cannot get through?&lt;/p&gt;</comment>
                            <comment id="139118" author="mhanafi" created="Sat, 16 Jan 2016 05:57:14 +0000"  >&lt;p&gt;We haven&apos;t been able to identify any network issues. As far as we can tell the network is find. &lt;/p&gt;

&lt;p&gt;what do you make of these messages. The downward slide of the servers is pre-seeded by these&lt;/p&gt;

&lt;p&gt;Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff880494ed2000&lt;br/&gt;
Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Skipped 5 previous similar messages&lt;br/&gt;
Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff881b124e4000&lt;br/&gt;
Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Skipped 9 previous similar messages&lt;/p&gt;
</comment>
                            <comment id="139119" author="green" created="Sat, 16 Jan 2016 06:31:56 +0000"  >&lt;p&gt;That unexpectedly long timeout is more of the same.&lt;br/&gt;
Network/network driver/network card is slow in trying to unregister buffers we are trying to unregister. Slow as in it takes over 300 seconds to unregister such buffers )this is what triggers the message).&lt;/p&gt;

&lt;p&gt;I think this is another sign of unhealthy network/card/driver. It&apos;s not normal for connection to a peer to fail with ETIMEOUT (-110)/UNREACHABLE as seen in the last snippet in my previous comment.&lt;/p&gt;</comment>
                            <comment id="139121" author="bob.c" created="Sat, 16 Jan 2016 06:59:26 +0000"  >&lt;p&gt;We also see many messages like this:&lt;br/&gt;
out.nbp2-oss18.1452913951.gz.denum:&lt;br/&gt;
00000800:00000200:15.0:1452913946.993806:0:21340:0:(o2iblnd.c:1898:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting 1024 HZs for her to complete.trips = 83498830&lt;/p&gt;

&lt;p&gt;This was part of a patch generated in &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-7054&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-7054&lt;/a&gt; &lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/16470/2/lnet/klnds/o2iblnd/o2iblnd.c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/16470/2/lnet/klnds/o2iblnd/o2iblnd.c&lt;/a&gt;&lt;br/&gt;
but we still see that there are a large number of &quot;complete.trips&quot; through. I has assumed that the &quot;waiting HZs&quot; of 1024 would slow this down, or does it simply schedule other threads if one waiting and not sleep (unclear to me), but in the traces I&apos;ve looked at, I dont see any new pools being successfully created (and the indication of how long pool creation took to complete). &lt;/p&gt;

&lt;p&gt;You must forgive me, grasping a little from memory...  I seem to recall that there were some competition between the freeing (unregister)  and pool allocation, is it possible that a something slow in the deallocation prevents new pools from being created?&lt;/p&gt;

&lt;p&gt;Also, since I&apos;m not familiar with this code (and I&apos;m looking at this on my apple watch) &lt;br/&gt;
the &quot;schedule_timeout(interval)&quot;, mapped to an inline null function. So I couldn&apos;t decipher yet.  &lt;/p&gt;</comment>
                            <comment id="139123" author="bob.c" created="Sat, 16 Jan 2016 07:55:20 +0000"  >&lt;p&gt;We still have two production filesystems down. This is a critical problem.&lt;/p&gt;

&lt;p&gt;We are going to try to run jobs on the remaining filesystems, but there were issues doing this earlier. So risky.&lt;/p&gt;

&lt;p&gt;We are going to investigate network issues. We have found no HW problems. &lt;/p&gt;

&lt;p&gt;Assuming that its not a network problem, do you have any suggestions as to where we should look? Debug settings? Other information we can provide to you? Mahmoud said that the traces uploaded show from boot to encountering the issue.&lt;/p&gt;</comment>
                            <comment id="139133" author="ashehata" created="Sat, 16 Jan 2016 19:41:55 +0000"  >&lt;p&gt;I looked through the log file attached and I see 442 instances of connection races, which occurs when two nodes are attempting to reconnect. This could result in a flurry of reconnects, which could consume memory. There is a prototype patch that has been done to address the same issue on another site. I&apos;m in the process of porting it to NASA&apos;s branch and I&apos;ll push it in later today for you to try.&lt;/p&gt;</comment>
                            <comment id="139135" author="ashehata" created="Sun, 17 Jan 2016 07:30:58 +0000"  >&lt;p&gt;Ported the patch here:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/18025/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18025/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="139140" author="liang" created="Sun, 17 Jan 2016 14:56:14 +0000"  >&lt;p&gt;I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can&apos;t really help because current thread wouldn&apos;t sleep. I have updated the patch uploaded by Amir (&lt;a href=&quot;http://review.whamcloud.com/#/c/16470/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/16470/&lt;/a&gt;), I also ported it to 2_5_fe (&lt;a href=&quot;http://review.whamcloud.com/18026&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18026&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="139190" author="doug" created="Mon, 18 Jan 2016 17:41:48 +0000"  >&lt;p&gt;Liang: is your patch in addition to the one Amir ported or is it a replacement for it?&lt;/p&gt;</comment>
                            <comment id="139193" author="doug" created="Mon, 18 Jan 2016 18:29:48 +0000"  >&lt;p&gt;NASA: Please apply both patches being discussed here: &lt;a href=&quot;http://review.whamcloud.com/#/c/18025/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18025/&lt;/a&gt; and &lt;a href=&quot;http://review.whamcloud.com/18026&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18026&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="139200" author="jaylan" created="Mon, 18 Jan 2016 19:54:25 +0000"  >&lt;p&gt;Patch 18026 missed a newline at line 1909.&lt;/p&gt;</comment>
                            <comment id="139203" author="jaylan" created="Mon, 18 Jan 2016 20:53:49 +0000"  >&lt;p&gt;It looks like &lt;a href=&quot;http://review.whamcloud.com/#/c/18025/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18025/&lt;/a&gt; is a backport of patch &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt; &lt;a href=&quot;http://review.whamcloud.com/#/c/17892/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/17892/&lt;/a&gt; that I have been asking for. Thanks.&lt;/p&gt;</comment>
                            <comment id="148183" author="jaylan" created="Thu, 7 Apr 2016 21:32:01 +0000"  >&lt;p&gt;We need a b2_7_fe back port also. ATM we plan to stop running 2.7.1 until we receive the back port.&lt;/p&gt;</comment>
                            <comment id="166931" author="mhanafi" created="Thu, 22 Sep 2016 16:20:17 +0000"  >&lt;p&gt;need to ad NASA label&lt;/p&gt;</comment>
                            <comment id="166984" author="pjones" created="Thu, 22 Sep 2016 22:38:22 +0000"  >&lt;p&gt;Actually fix landed under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7569&quot; title=&quot;IB leaf switch caused LNet routers to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7569&quot;&gt;&lt;del&gt;LU-7569&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="31763">LU-7054</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="33736">LU-7569</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="20147" name="out.1452910586.gz" size="231" author="mhanafi" created="Sat, 16 Jan 2016 02:28:22 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 29 Apr 2016 02:28:22 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxydr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10020"><![CDATA[1]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 18 Jan 2016 02:28:22 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>