<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:07:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-436] client refused reconnection, still busy with 1 active RPCs</title>
                <link>https://jira.whamcloud.com/browse/LU-436</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We are intermittently seeing problems with our scratch file system where any of 24 OSS nodes gets congested and essentially makes the file system unusable. We are trying to nail down the rogue user code or codes that seem to trigger it but we believe the cause is a large number of small reads or writes to a OST. &lt;/p&gt;

&lt;p&gt;Looking at dmesgs we see a lot of &quot;...refused reconnection, still busy with 1 active RPCs&quot; and the load on the system goes through the roof typically with load averages greater than 400-500. Trying to do some forensics we parsed the nodes that are reported in dmesg, went to those nodes and tried doing a lsof on the file system which basically hanged. Thinking these were good candidates we powered them down but it did not change any of the server conditions. As in the past, our only resolution was to cycle the OSS which cleared everything.&lt;/p&gt;

&lt;p&gt;I was looking through bugs and it seems like it is very similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7&quot; title=&quot;Reconnect server-&amp;gt;client connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7&quot;&gt;&lt;del&gt;LU-7&lt;/del&gt;&lt;/a&gt; that Chris reported in November with the exception that in this case we are not going through a router.&lt;/p&gt;

&lt;p&gt;I guess what I am looking for or hoping for is some type if diagnostic that can help determine the source of the congestion vs. the sledge hammer approach of rebooting the server and causing a wider disruption.&lt;/p&gt;

&lt;p&gt;I realize this issue is very general but I just wanted to get a dialog going. I have plenty of log data and lustre dumps if that would be helpful. &lt;/p&gt;</description>
                <environment>Lustre 1.8.3 on LLNL chaos release 1.3.4 (2.6.18-93.2redsky_chaos). Redsky 2QoS torus IB network cluster, software raid on Oracle J4400 JBODs - RAID6 8+2 w/external journal &amp;amp; bitmap</environment>
        <key id="11202">LU-436</key>
            <summary>client refused reconnection, still busy with 1 active RPCs</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="jamervi">Joe Mervini</reporter>
                        <labels>
                    </labels>
                <created>Mon, 20 Jun 2011 19:23:23 +0000</created>
                <updated>Mon, 4 Jun 2012 05:38:04 +0000</updated>
                            <resolved>Mon, 4 Jun 2012 05:38:04 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="16645" author="pjones" created="Mon, 20 Jun 2011 22:02:24 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16664" author="m.magrys" created="Tue, 21 Jun 2011 05:10:34 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;We also have observed a similar problem some time ago, so I can confirm that such problem exists. But it&apos;s not easy reproducable, probably one of our users might cause this, but it&apos;s hard to identify him, at least for now.&lt;/p&gt;

&lt;p&gt;Marek&lt;/p&gt;</comment>
                            <comment id="16737" author="laisiyao" created="Wed, 22 Jun 2011 03:46:55 +0000"  >&lt;p&gt;Hi Joe,&lt;/p&gt;

&lt;p&gt;Could you checkout &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=22423&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;bz22423&lt;/a&gt;? Could you verify the fix for 1.8.3 included in your code?&lt;/p&gt;

&lt;p&gt;IMHO with that fix, OSS won&apos;t be throttled with reconnect, though the client may still be evicted wrongly due to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7&quot; title=&quot;Reconnect server-&amp;gt;client connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7&quot;&gt;&lt;del&gt;LU-7&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Lai&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="16972" author="jamervi" created="Sat, 25 Jun 2011 12:01:22 +0000"  >&lt;p&gt;I can confirm that the 1.8.3 patch reference in bz22423 has been applied to our build of lustre. &lt;/p&gt;

&lt;p&gt;We had the problem reappear 3 times yesterday which hung the file system and required server reboots to clear. We believe we have narrowed the cause down to 2 different codes and have had the users move there working directories to another lustre file system that is &lt;em&gt;not&lt;/em&gt; using software raid. &lt;/p&gt;

&lt;p&gt;In talking with one of the users, he characterized his IO as basically 1000 threads writing 512k chunks of data to a file. In my mind that, coupled with the overhead of software raid, could possibly cause the overload on the server. Does that sound reasonable? Regardless of this IO pattern I think that lustre should be able to deal with this type of event more gracefully.&lt;/p&gt;

&lt;p&gt;With regard to the file system, we are running lustre with pretty much default settings (i.e., we haven&apos;t done any real tuning of the file system beyond best practices because the IO patterns on our clusters vary widely). Our default max threads on our OSS servers is 512. In the past it was suggested that we reduce that number before it was determined the problems were bugs that were resolved in patches so we have never made that boot time adjustment. But if there is a way to minimize the potential for file system hangs via boot/runtime adjustments, a slower file system is better than an unusable one. I would just need some guidance on which tunables to adjust and how they should be set.&lt;/p&gt;</comment>
                            <comment id="16978" author="adilger" created="Mon, 27 Jun 2011 01:10:19 +0000"  >&lt;p&gt;I recall several times during testing of Snowbird OSS nodes that the optimum thread count was around 32 per OST, though it isn&apos;t possible to limit the threads to be accessing on a single OST if the striping is imbalanced.&lt;/p&gt;</comment>
                            <comment id="34938" author="cliffw" created="Tue, 17 Apr 2012 12:19:25 +0000"  >&lt;p&gt;Do we need anything further on this bug, or can this issue be closed?&lt;/p&gt;</comment>
                            <comment id="34951" author="jamervi" created="Tue, 17 Apr 2012 14:19:35 +0000"  >&lt;p&gt;Cliff - yes we should close this. We are running 1.8.5 on these particular file systems and even though we continue to see load issues, we have plans to decommission our software raid lustre file systems in the near future. &lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>hang</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvzyn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10072</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>