<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:20:58 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1934] still busy with active RPCs for days</title>
                <link>https://jira.whamcloud.com/browse/LU-1934</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;OST refuses reconnection from client with &apos;still busy with 10 active RPCs&apos;.  The OST is failing to time out the active RPCs and has been in this state for over a day.  Client continually attempts to reconnect.  We have a few client nodes in this state that can be used to investigate further.&lt;/p&gt;

&lt;p&gt;LLNL-bugzilla-ID: 1495&lt;/p&gt;</description>
                <environment>&lt;a href=&quot;https://github.com/chaos/lustre/commits/2.1.2-3chaos&quot;&gt;https://github.com/chaos/lustre/commits/2.1.2-3chaos&lt;/a&gt;</environment>
        <key id="15952">LU-1934</key>
            <summary>still busy with active RPCs for days</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="nedbass">Ned Bass</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 13 Sep 2012 14:11:18 +0000</created>
                <updated>Mon, 29 May 2017 05:40:01 +0000</updated>
                            <resolved>Mon, 29 May 2017 05:40:01 +0000</resolved>
                                                    <fixVersion>Lustre 2.4.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="44829" author="nedbass" created="Thu, 13 Sep 2012 19:08:20 +0000"  >&lt;p&gt;Grepping a stuck client NID in /proc/fs/lustre/ost/OSS/ost/req_history shows&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;1269062278:172.18.102.20@tcp:12345-172.18.110.123@tcp:x1405404304029454:296:Interpret:1347406306:-1347406306s(-1347407432s) opc 101
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I believe this is an LDLM lock enqueue.&lt;/p&gt;</comment>
                            <comment id="44833" author="nedbass" created="Thu, 13 Sep 2012 20:19:17 +0000"  >&lt;p&gt;A couple of service thread on the OSS just completed after about 48 hours and the client reconnected. I dumped all thread backtraces while they were hung, and both  looked like this (hand-copied):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;? ptl_send_rpc
schedule timeout
? process_timeout
cfs_waitq_timedwait
ptlrpc_set_wait
? default_wake_function
ptlrpc_queue_wait
ldlm_server_glimpse_ast
filter_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
ldlm_handle_enqueue
? ldlm_server_completion_ast
? ost_blocking_ast
? ldlm_server_glimpse_ast
ost_handle
? lustre_msg_get_transno
ptlrpc_main
child_rip
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="44883" author="green" created="Fri, 14 Sep 2012 13:16:45 +0000"  >&lt;p&gt;When the threads completed I bet there were some messages about how they took too long, and then before completion probably something about failed glimpse AST or some such. Can you please show an example of those messages?&lt;/p&gt;</comment>
                            <comment id="44890" author="nedbass" created="Fri, 14 Sep 2012 14:16:02 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;dusk23 2012-09-13 16:47:29 ... ptlrpc_server_handle_request() @@@ Request took longer than estimated (940:172446s); client may timeout.
req@... x.../t0(0) o101-&amp;gt;...@172.18.110.123@tcp:0/0 lens 296/352 e 1 to 0 dl 134707603 ref 1 fl Complete:/0/0 rc 301/301
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think I need to paint a more complete picture here.  There seems to have been some interaction between multiple clients and OSTs:&lt;/p&gt;

&lt;p&gt;OSS/OST: dusk23/OST0143 (where the above message was logged)&lt;br/&gt;
OSS/OST: dusk20/OST0166&lt;br/&gt;
client: graph555 (172.18.110.123 from above message)&lt;br/&gt;
client: cslic5&lt;/p&gt;

&lt;p&gt;graph555 was unable to reconnect to dusk20/OST0143 (busy with 1 active RPCs) and to dusk23/OST0166 (busy with 4 active RPCs)&lt;br/&gt;
cslic5 was unable to reconnect to dusk23/OST0166 (busy with 10 active RPCs)&lt;/p&gt;

&lt;p&gt;This had been going on for some days.&lt;/p&gt;

&lt;p&gt;At 2012-09-13 16:42:30 dusk20 was powered off by it&apos;s failover partner (it seems to have locked up after I did a sysrq-t).&lt;/p&gt;

&lt;p&gt;Exactly 5 minutes later (recovery window for dusk20&apos;s failover partner) the above message was logged, and client connections were restored to all OSTs.  So it appears that the processing of cslic5&apos;s RPCs by dusk23 was blocked somehow by graph555.  I confess I don&apos;t quite understand how such a dependency comes about.  But I guess the question is, why did dusk20 fail to handle or time out the active RPC from graph555?&lt;/p&gt;</comment>
                            <comment id="44892" author="nedbass" created="Fri, 14 Sep 2012 14:31:31 +0000"  >&lt;p&gt;The req_history entry in my first comment was the active RPC from graph555 on dusk20.  I find an entry for that xid in dusk20&apos;s logs (preceded several minutes earlier by watchdog stack trace)&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2012-09-11 16:41:46 Lustre: Service thread pid 6655 was inactive for 600.00s ... dumping stack trace
Pid: 6655, comm: ll_ost_186
schedule_timeout
cfs_waitq_timedwait
ptlrpc_set_wait
ptlrpc_queue_wait
ldlm_server_glimpse_ast
filter_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
ldlm_handle_enqueue
ost_handle
ptlrpc_main
child_rip

2012-09-11 16:50:22 Lustre: 6866:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (10/-516), not sending early reply
    req@... x1405404304029454/t0(0) o101-&amp;gt;...@172.18.110.123@tcp:0/0 lens 296/352 e 4 to 0 dl 134707432 ref 2 fl Interpret:/0/0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Later, when I dumped stack traces at 2012-09-13 16:41:46, thread ll_ost_186 has basically the exact same stack trace.&lt;/p&gt;</comment>
                            <comment id="44893" author="pjones" created="Fri, 14 Sep 2012 14:37:43 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="197434" author="adilger" created="Mon, 29 May 2017 05:40:01 +0000"  >&lt;p&gt;Client reconnection has been fixed.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="15992">LU-1949</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 4 Jul 2014 14:11:18 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv5fj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4414</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 13 Sep 2012 14:11:18 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>