<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:02:51 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10] Client stuck in ptlrpc_invalidate_import() after eviction</title>
                <link>https://jira.whamcloud.com/browse/LU-10</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We had a client get stuck in ptlrpc_invalidate_import() after it was evicted.  Info will be limited since it was on the secure network.&lt;/p&gt;

&lt;p&gt;On the console, the client is printing this every ten minutes:&lt;/p&gt;

&lt;p&gt;  ptlrpc_invalidate_import()) ls3-OST01c4_UUID: rc = -110 waiting for callback&lt;br/&gt;
(1 != 0)&lt;br/&gt;
  ptlrpc_invalidate_import()) Skipped 5 previous similar messages&lt;br/&gt;
  ptlrpc_invalidate_import()) @@@ still on sending list  req@&amp;lt;hex&amp;gt; x&amp;lt;xid&amp;gt;/t0&lt;br/&gt;
o4-&amp;gt;ls3-OST01c4_UUID@&amp;lt;ip&amp;gt;@tcp:6/4 len 448/608 e 5 to 1 dl &amp;lt;time&amp;gt; ref 2 fl&lt;br/&gt;
Unregistering:ES/0/0 rc -4/0&lt;br/&gt;
  ptlrpc_invalidate_import()) Skipped 5 previous similar messages&lt;br/&gt;
  ptlrpc_invalidate_import()) ls3-OST01c4_UUID: RPCs in &quot;Unregistering&quot; phase&lt;br/&gt;
found (1). Network is sluggish? Waiting them to error out.&lt;br/&gt;
  ptlrpc_invalidate_import()) Skipped 5 previous similar messages&lt;/p&gt;

&lt;p&gt;and it is the ll_imp_inval thread that appears to be looping indefinitely (it was printing that for well over a month before I was alerted to the problem).&lt;/p&gt;

&lt;p&gt;The thread &quot;ldlm_bl_11&quot; was stuck in sync_page(), with the following backtrace:&lt;/p&gt;

&lt;p&gt;schedule&lt;br/&gt;
io_schedule&lt;br/&gt;
sync_page&lt;br/&gt;
__wait_on_bit_lock&lt;br/&gt;
__lock_page&lt;br/&gt;
ll_page_removal_cb&lt;br/&gt;
cache_remove_lock&lt;br/&gt;
lock_handle_addref&lt;br/&gt;
class_handle2object&lt;br/&gt;
ldlm_cli_cancel_local&lt;br/&gt;
ldlm_cli_cancel&lt;br/&gt;
osc_extent_blocking_cb&lt;br/&gt;
ldlm_handle_bl_callback&lt;br/&gt;
ldlm_bl_thread_main&lt;/p&gt;

&lt;p&gt;Whether that is symptom or cause for the hung import invalidate, I do not know.&lt;/p&gt;</description>
                <environment>Lustre 1.8.3.0-5chaos</environment>
        <key id="10085">LU-10</key>
            <summary>Client stuck in ptlrpc_invalidate_import() after eviction</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                    </labels>
                <created>Tue, 26 Oct 2010 15:06:18 +0000</created>
                <updated>Tue, 28 Jun 2011 15:01:40 +0000</updated>
                            <resolved>Mon, 13 Jun 2011 14:35:49 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="10082" author="samb" created="Tue, 26 Oct 2010 15:30:04 +0000"  >&lt;p&gt;FYI, Problem is being actively looked at now.&lt;/p&gt;</comment>
                            <comment id="10084" author="rread" created="Wed, 27 Oct 2010 00:37:13 +0000"  >&lt;p&gt;Lai, please look into this.&lt;/p&gt;


&lt;p&gt;Chris, where can we get the source tree for the version being used in production?&lt;/p&gt;</comment>
                            <comment id="10086" author="laisiyao" created="Wed, 27 Oct 2010 05:24:38 +0000"  >&lt;p&gt;Chris, could you get the backtrace of all processes on that machine? I want to know which process may have locked the page to be removed by ldlm_bl_11.&lt;/p&gt;</comment>
                            <comment id="10092" author="morrone" created="Wed, 27 Oct 2010 13:17:11 +0000"  >&lt;p&gt;The source is available here:&lt;/p&gt;

&lt;p&gt;  &lt;a href=&quot;http://github.com/morrone/lustre&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://github.com/morrone/lustre&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I typed in the the only backtrace that was interesting.  No other processes have a backtrace that explains why the lock is held.  Everything else was pretty much in normal idle state.&lt;/p&gt;</comment>
                            <comment id="10093" author="rread" created="Wed, 27 Oct 2010 13:34:20 +0000"  >&lt;p&gt;I suspect ldlm_bl_11 is waiting for the same rpc that the invalidate thread is waiting for, so this is probably a symptom. &lt;/p&gt;</comment>
                            <comment id="10115" author="laisiyao" created="Thu, 28 Oct 2010 09:20:50 +0000"  >&lt;p&gt;What I can tell from the messages is:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;import is still in EVICTED state.&lt;/li&gt;
	&lt;li&gt;the req in sending_list is an OST_WRITE. Is it always in RQ_PHASE_UNREGISTERING? If so, it means ptlrpc_bulk_desc-&amp;gt;bd_network_rw is 1.&lt;/li&gt;
	&lt;li&gt;ldlm_bl_11 is stuck in sync_page() because the above request is not complete, and page is still locked by that request.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So as the log message suggested &quot;Network is sluggish?&quot;, I will continue checking the code why ptlrpc_bulk_desc-&amp;gt;bd_network_rw is 1 under this condition.&lt;/p&gt;</comment>
                            <comment id="10123" author="laisiyao" created="Sun, 31 Oct 2010 19:46:49 +0000"  >&lt;p&gt;I believe this is the same bug of &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=21760&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/show_bug.cgi?id=21760&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And &lt;a href=&quot;https://bugzilla.lustre.org/attachment.cgi?id=30963&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/attachment.cgi?id=30963&lt;/a&gt; is the patch, it looks working, but not landed yet.&lt;/p&gt;</comment>
                            <comment id="10135" author="rread" created="Tue, 2 Nov 2010 14:17:59 +0000"  >&lt;p&gt;Lai, that patch has been backed out of the tree (that&apos;s what the - flags are for). However, the new attachment looks promising: &lt;a href=&quot;https://bugzilla.lustre.org/attachment.cgi?id=32032&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.lustre.org/attachment.cgi?id=32032&lt;/a&gt;  &lt;/p&gt;</comment>
                            <comment id="10137" author="laisiyao" created="Tue, 2 Nov 2010 19:21:10 +0000"  >&lt;p&gt;Robert, though the original patch is reverted, I think it&apos;s correct, and Dmitry will continue discussing with Johann. As for the patch you mentioned, it&apos;s needless and has been discarded according to the latest update on bugzilla.&lt;/p&gt;</comment>
                            <comment id="10164" author="dferber" created="Tue, 9 Nov 2010 16:08:25 +0000"  >&lt;p&gt;Lai, can you post your test results and any other thoughts to the BZ bug, as that would help Dimitry, Oleg, and Cory, and maybe note in this bug that they&apos;ve been posted there. Do you still think, as Dimitry does, that the patch in the bug will fix this problem?&lt;/p&gt;</comment>
                            <comment id="10167" author="laisiyao" created="Wed, 10 Nov 2010 06:13:58 +0000"  >&lt;p&gt;I think the root cause of this bug is not we forget to unregister bulk, but mix reply&lt;br/&gt;
unregistering and bulk unregistering phase together. Dmitry&apos;s patch may cause bulk unregistered&lt;br/&gt;
mistakenly (see code near after_reply() in ptlrpc_check_set(), it only wants to unregister reply,&lt;br/&gt;
but not bulk).&lt;/p&gt;

&lt;p&gt;This patch &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/10033/10033_b_21760.diff&quot; title=&quot;b_21760.diff attached to LU-10&quot;&gt;b_21760.diff&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; adds a new phase REQ_PHASE_BULK_UNREGISTERING, and request in REQ_PHASE_UNREGISTERING&lt;br/&gt;
will only wait for reply unregistered, while request in REQ_PHASE_BULK_UNREGISTERING waits for bulk&lt;br/&gt;
unregistered.&lt;/p&gt;</comment>
                            <comment id="10176" author="dferber" created="Fri, 12 Nov 2010 14:32:12 +0000"  >&lt;p&gt;Lai, are you ready for Chris to test your attached patch?&lt;/p&gt;</comment>
                            <comment id="10178" author="laisiyao" created="Fri, 12 Nov 2010 17:21:10 +0000"  >&lt;p&gt;Yes, this patch should be able to fix the symptom listed above; and for bug 21760, it may involve other bugs, I will continue looking into that. &lt;/p&gt;</comment>
                            <comment id="10181" author="laisiyao" created="Tue, 16 Nov 2010 17:16:37 +0000"  >&lt;p&gt;This patch has problem in handling expired request; and Johan thinks it&apos;s too big a change and maybe too intrusive, he will propose a patch later.&lt;/p&gt;</comment>
                            <comment id="10184" author="laisiyao" created="Wed, 17 Nov 2010 06:00:13 +0000"  >&lt;p&gt;The patch I proposed will cause problem upon network errors, and Johann said he will provide a less intrusive patch, it&apos;s better to wait for Johann&apos;s fix and then start testing.&lt;/p&gt;</comment>
                            <comment id="10194" author="laisiyao" created="Fri, 19 Nov 2010 01:35:33 +0000"  >&lt;p&gt;Johann provided a &lt;a href=&quot;https://bugzilla.lustre.org/attachment.cgi?id=32248&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;patch&lt;/a&gt;, but I think it may be incomplete (he said he will rethink it), however it can fix the symptom described above. So it&apos;s okay to starting testing with Johann&apos;s patch now.&lt;/p&gt;</comment>
                            <comment id="10265" author="morrone" created="Wed, 1 Dec 2010 14:15:13 +0000"  >&lt;p&gt;Johann landed the &lt;a href=&quot;https://bugzilla.lustre.org/attachment.cgi?id=32248&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;patch&lt;/a&gt; on b1_8 for 1.8.6.  I will pull it into the llnl branch.&lt;/p&gt;</comment>
                            <comment id="10501" author="laisiyao" created="Thu, 27 Jan 2011 23:34:15 +0000"  >&lt;p&gt;Hi Chris, did you see this failure again after landing? if not, can we close this issue?&lt;/p&gt;</comment>
                            <comment id="10503" author="morrone" created="Fri, 28 Jan 2011 11:28:24 +0000"  >&lt;p&gt;It was landed, but the code hasn&apos;t made it onto production clusters yet.  It is rolling out with a release now.&lt;/p&gt;</comment>
                            <comment id="16098" author="pjones" created="Mon, 13 Jun 2011 14:35:49 +0000"  >&lt;p&gt;This has been running in production for a while so I think that it is safe to mark it as resolved. Please reopen if this reoccurs&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10033" name="b_21760.diff" size="5739" author="laisiyao" created="Wed, 10 Nov 2010 06:13:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw11r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10250</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10023"><![CDATA[4]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>