<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:18:12 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8511] mdc stuck in EVICTED state</title>
                <link>https://jira.whamcloud.com/browse/LU-8511</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On our 2.8 DNE testbed, we are seeing, not too infrequently, MDCs that get stuck in the EVICTED state.  The clients are running Lustre 2.8.0_0.0.llnlpreview.18 (see the lustre-release-fe-llnl repo).&lt;/p&gt;

&lt;p&gt;The MDC seems to be permanently stuck.  See the following example:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# cat state
current_state: EVICTED
state_history:
 - [ 1470865592, DISCONN ]
 - [ 1470865612, CONNECTING ]
 - [ 1470865667, DISCONN ]
 - [ 1470865687, CONNECTING ]
 - [ 1470865742, DISCONN ]
 - [ 1470865762, CONNECTING ]
 - [ 1470865762, DISCONN ]
 - [ 1470865771, CONNECTING ]
 - [ 1470865771, REPLAY ]
 - [ 1470865771, REPLAY_LOCKS ]
 - [ 1470865771, REPLAY_WAIT ]
 - [ 1470865831, RECOVER ]
 - [ 1470865831, FULL ]
 - [ 1470950043, DISCONN ]
 - [ 1470950043, CONNECTING ]
 - [ 1470950043, EVICTED ]
[root@opal70:lquake-MDT000a-mdc-ffff88201e65e000]# date +%s
1471481367
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that it appears to have stopped trying to connect after the eviction, and that was apparently over six days ago.&lt;/p&gt;

&lt;p&gt;On the client&apos;s console I see:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-08-17 17:24:02 [705378.725578] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:24:02 [705378.741582] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:24:02 [705378.754003] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37-&amp;gt;lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:24:02 [705378.786766] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:24:02 [705378.799218] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in &quot;Unregistering&quot; phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:24:02 [705378.819955] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.145608] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:34:02 [705979.161601] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.174037] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37-&amp;gt;lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:34:02 [705979.206710] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:34:02 [705979.219194] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in &quot;Unregistering&quot; phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:34:02 [705979.239870] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:35:08 [706044.254083] hsi0: can&apos;t use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
2016-08-17 17:39:41 [706317.503378] hsi0: can&apos;t use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
2016-08-17 17:44:02 [706579.565744] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: rc = -110 waiting for callback (1 != 0)
2016-08-17 17:44:03 [706579.581674] LustreError: 27987:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:44:03 [706579.594086] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880f415d8000 x1542220409542908/t0(0) o37-&amp;gt;lquake-MDT000a-mdc-ffff88201e65e000@172.19.1.121@o2ib100:23/10 lens 568/440 e 0 to 0 dl 1470949229 ref 2 fl Unregistering:RE/0/ffffffff rc -5/-1
2016-08-17 17:44:03 [706579.626684] LustreError: 27987:0:(import.c:364:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
2016-08-17 17:44:03 [706579.639068] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) lquake-MDT000a_UUID: RPCs in &quot;Unregistering&quot; phase found (1). Network is sluggish? Waiting them to error out.
2016-08-17 17:44:03 [706579.659692] LustreError: 27987:0:(import.c:379:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &quot;lquake-MDT000a_UUID: rc = -110 waiting for callback&quot; repeats every ten minutes.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 27781  TASK: ffff880f6be4e780  CPU: 5   COMMAND: &quot;ll_sa_25078&quot;
 #0 [ffff88202318b700] __schedule+0x295 at ffffffff81651975
 #1 [ffff88202318b768] schedule+0x29 at ffffffff81652049
 #2 [ffff88202318b778] schedule_timeout+0x175 at ffffffff8164fa75
 #3 [ffff88202318b820] ptlrpc_set_wait+0x4c0 at ffffffffa0dafda0 [ptlrpc]
 #4 [ffff88202318b8c8] ptlrpc_queue_wait+0x7d at ffffffffa0db025d [ptlrpc]
 #5 [ffff88202318b8e8] mdc_getpage+0x1e1 at ffffffffa0fadf61 [mdc]
 #6 [ffff88202318b9c8] mdc_read_page_remote+0x135 at ffffffffa0fae535 [mdc]
 #7 [ffff88202318ba48] do_read_cache_page+0x7f at ffffffff81170cbf
 #8 [ffff88202318ba90] read_cache_page+0x1c at ffffffff81170e1c
 #9 [ffff88202318baa0] mdc_read_page+0x1b4 at ffffffffa0fab314 [mdc]
#10 [ffff88202318bb90] lmv_read_striped_page+0x5f8 at ffffffffa0ff14a7 [lmv]
#11 [ffff88202318bca8] lmv_read_page+0x521 at ffffffffa0fe34e1 [lmv]
#12 [ffff88202318bd00] ll_get_dir_page+0xc8 at ffffffffa1015178 [lustre]
#13 [ffff88202318bd40] ll_statahead_thread+0x2bc at ffffffffa10691cc [lustre]
#14 [ffff88202318bec8] kthread+0xcf at ffffffff810a997f
#15 [ffff88202318bf50] ret_from_fork+0x58 at ffffffff8165d658
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another client node that has had a single MDC (out of 16) stuck in EVICTED state for nearly 7 days also has a single SA thread stuck:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt -sx 166895
PID: 166895  TASK: ffff880fdcd6a280  CPU: 1   COMMAND: &quot;ll_sa_166412&quot;
 #0 [ffff880f0f213700] __schedule+0x295 at ffffffff81651975
 #1 [ffff880f0f213768] schedule+0x29 at ffffffff81652049
 #2 [ffff880f0f213778] schedule_timeout+0x175 at ffffffff8164fa75
 #3 [ffff880f0f213820] ptlrpc_set_wait+0x4c0 at ffffffffa0dc4da0 [ptlrpc]
 #4 [ffff880f0f2138c8] ptlrpc_queue_wait+0x7d at ffffffffa0dc525d [ptlrpc]
 #5 [ffff880f0f2138e8] mdc_getpage+0x1e1 at ffffffffa0fc2f61 [mdc]
 #6 [ffff880f0f2139c8] mdc_read_page_remote+0x135 at ffffffffa0fc3535 [mdc]
 #7 [ffff880f0f213a48] do_read_cache_page+0x7f at ffffffff81170cbf
 #8 [ffff880f0f213a90] read_cache_page+0x1c at ffffffff81170e1c
 #9 [ffff880f0f213aa0] mdc_read_page+0x1b4 at ffffffffa0fc0314 [mdc]
#10 [ffff880f0f213b90] lmv_read_striped_page+0x5f8 at ffffffffa10064a7 [lmv]
#11 [ffff880f0f213ca8] lmv_read_page+0x521 at ffffffffa0ff84e1 [lmv]
#12 [ffff880f0f213d00] ll_get_dir_page+0xc8 at ffffffffa102a178 [lustre]
#13 [ffff880f0f213d40] ll_statahead_thread+0x2bc at ffffffffa107e1cc [lustre]
#14 [ffff880f0f213ec8] kthread+0xcf at ffffffff810a997f
#15 [ffff880f0f213f50] ret_from_fork+0x58 at ffffffff8165d658
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I can&apos;t necessarily say that SA is implicated though.  It could simply be that SA is hanging because someone discovered the problem by running &quot;ls&quot;.&lt;/p&gt;</description>
                <environment></environment>
        <key id="38938">LU-8511</key>
            <summary>mdc stuck in EVICTED state</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 18 Aug 2016 01:05:45 +0000</created>
                <updated>Mon, 31 Oct 2016 15:03:51 +0000</updated>
                            <resolved>Mon, 31 Oct 2016 15:03:21 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="162442" author="pjones" created="Thu, 18 Aug 2016 20:47:01 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please advise on this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="162502" author="bobijam" created="Fri, 19 Aug 2016 08:35:45 +0000"  >&lt;p&gt;I think it relates to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7434&quot; title=&quot;lost bulk leads to a hang&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7434&quot;&gt;&lt;del&gt;LU-7434&lt;/del&gt;&lt;/a&gt;, and the relevant patch port has been pushed at &lt;a href=&quot;http://review.whamcloud.com/20230&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20230&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="168595" author="morrone" created="Thu, 6 Oct 2016 23:46:14 +0000"  >&lt;p&gt;When can we expect the port to be reviewed?&lt;/p&gt;</comment>
                            <comment id="171754" author="pjones" created="Mon, 31 Oct 2016 15:03:21 +0000"  >&lt;p&gt;Duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7434&quot; title=&quot;lost bulk leads to a hang&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7434&quot;&gt;&lt;del&gt;LU-7434&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="33160">LU-7434</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyl53:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>