<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:06:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-416] Many processes hung consuming a lot of CPU in Lustre-Client page-cache lookups</title>
                <link>https://jira.whamcloud.com/browse/LU-416</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;At CEA they see quite often a problem on Lustre clients where processes are stuck consuming a lot of CPU time in Lustre layers. Unfortunately the only way to really fix this for now is to reboot the impacted nodes (after waiting for them for several hours), since involved processes are not killable.&lt;/p&gt;

&lt;p&gt;Crash dump analysis shows processes stuck with the following stack traces (crash dumps can be analyzed only on customer site):&lt;/p&gt;

&lt;p&gt;=========================================================&lt;br/&gt;
_spin_lock()&lt;br/&gt;
cl_page_gang_lookup()&lt;br/&gt;
cl_lock_page_out()&lt;br/&gt;
osc_lock_flush()&lt;br/&gt;
osc_lock_cancel()&lt;br/&gt;
cl_lock_cancel0()&lt;br/&gt;
.....&lt;br/&gt;
=========================================================&lt;/p&gt;

&lt;p&gt;and/or &lt;br/&gt;
=========================================================&lt;br/&gt;
__cond_resched()&lt;br/&gt;
_cond_resched()&lt;br/&gt;
cfs_cond_resched()&lt;br/&gt;
cl_lock_page_out()&lt;br/&gt;
osc_lock_flush()&lt;br/&gt;
osc_lock_cancel()&lt;br/&gt;
cl_lock_cancel0()&lt;br/&gt;
.....&lt;br/&gt;
=========================================================&lt;/p&gt;

&lt;p&gt;In attachment you will find 3 files:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;node1330_dmesg is the dmesg of the faulty client;&lt;/li&gt;
	&lt;li&gt;node1330_lctl_dk is the &apos;lctl dk&apos; output from the faulty client;&lt;/li&gt;
	&lt;li&gt;cmds.txt is the sequence of commands played to get the &apos;lctl dk&apos; output.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There are also &quot;ll_imp_inval&quot; threads stuck due to this problem, leaving OSCs in &quot;IN&quot;active state during a too long time finally causing time-outs and EIOs for client processes.&lt;br/&gt;
Data structures involved are &quot;cl_object_header.&lt;span class=&quot;error&quot;&gt;&amp;#91;coh_page_guard,coh_tree&amp;#93;&lt;/span&gt;&quot;, respectively for the lock/radix-tree used to manage the page-cache associated to a Lustre-Client object.&lt;/p&gt;

&lt;p&gt;It seems to be a race around the OSC object pages lock/radix-tree when concurrent access occur (OOM, flush, invalidation, concurrent I/O). This problem seems to occur when, on the same&lt;br/&gt;
Lustre-Client, there are concurrent accesses on the same Lustre objects, inducing a competition on the associated lock and radix-tree from multiple CPUs.&lt;br/&gt;
To reproduce this issue, CEA is using one of their proprietary benchmark. But basically, on a single node there are as many processes as cores on this machine, each process mapping a lot of memory. The processes write this memory to Lustre, preferably on the same OST to reproduce the problem. CEA noticed that OSC inactivation process in client eviction can be involved during&lt;br/&gt;
issue reproduction. So a part of the reproducer can be to manually force client eviction on OSS side by using either:&lt;br/&gt;
lct set_param obdfilter.&amp;lt;fs_name&amp;gt;-&amp;lt;OST_name&amp;gt;.evict_client=nid:&amp;lt;ipoib_clnt_addr&amp;gt;@&amp;lt;portal_name&amp;gt;&lt;br/&gt;
or:&lt;br/&gt;
echo &apos;nid:&amp;lt;ipoib_clnt_addr&amp;gt;@&amp;lt;portal_name&amp;gt;&apos; &amp;gt; /proc/fs/lustre/obdfilter/&amp;lt;fs_name&amp;gt;/&amp;lt;OST_name&amp;gt;/evict_client&lt;/p&gt;

&lt;p&gt;In order to cope with production imperatives, CEA has setup a work-around that consists in freeing pagecache with &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot;. Doing so, clients will be able to reconnect. On the contrary, and it is interesting to note, clearing the LRU with &quot;lctl set_param ldlm.namespaces.*.lru_size=clear&quot; will hang the node!&lt;/p&gt;

&lt;p&gt;Does this issue sound familiar?&lt;br/&gt;
Of course CEA really need a fix for this, as soon as possible.&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</description>
                <environment></environment>
        <key id="11171">LU-416</key>
            <summary>Many processes hung consuming a lot of CPU in Lustre-Client page-cache lookups</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="sebastien.buisson">Sebastien Buisson</reporter>
                        <labels>
                    </labels>
                <created>Wed, 15 Jun 2011 03:55:23 +0000</created>
                <updated>Fri, 5 Aug 2011 09:54:12 +0000</updated>
                            <resolved>Sat, 30 Jul 2011 22:57:34 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="16380" author="pjones" created="Wed, 15 Jun 2011 07:46:58 +0000"  >&lt;p&gt;Oleg&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16397" author="jay" created="Wed, 15 Jun 2011 14:31:05 +0000"  >&lt;p&gt;It looks like those processes are busy discarding pages. During this process, they all need to grab a global spin lock to do this. So if possible, it would be interesting to verify it by running oprofile.&lt;/p&gt;</comment>
                            <comment id="16462" author="sebastien.buisson" created="Thu, 16 Jun 2011 12:07:41 +0000"  >&lt;p&gt;CEA confirms that processes are waiting for ages on this global spin lock.&lt;/p&gt;</comment>
                            <comment id="16472" author="jay" created="Thu, 16 Jun 2011 15:19:23 +0000"  >&lt;p&gt;what kernel r you using? If you;re using rhel6, we can use lockless radix tree to fix this problem; otherwise, I will try to work out a workaround to mitigate it. &lt;/p&gt;</comment>
                            <comment id="16484" author="pjones" created="Thu, 16 Jun 2011 16:31:04 +0000"  >&lt;p&gt;Jinshan&lt;/p&gt;

&lt;p&gt;Yes they are using RHEL6. Do you need anything more precise than that?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16492" author="jay" created="Thu, 16 Jun 2011 18:33:38 +0000"  >&lt;p&gt;Hi Seba,&lt;/p&gt;

&lt;p&gt;If you have a testing system, you may try patch at &lt;a href=&quot;http://review.whamcloud.com/#change,911&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,911&lt;/a&gt;. That patch is for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-394&quot; title=&quot;LND failure casued by discontiguous KIOV pages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-394&quot;&gt;&lt;del&gt;LU-394&lt;/del&gt;&lt;/a&gt;, but I think it can mitigate the contention on coh_page_guard a little bit.&lt;/p&gt;

&lt;p&gt;I&apos;m working on using RCU radix tree to solve the problem.&lt;/p&gt;

&lt;p&gt;Jinshan&lt;/p&gt;</comment>
                            <comment id="16493" author="bfaccini" created="Thu, 16 Jun 2011 18:35:08 +0000"  >&lt;p&gt;I would like to precise/correct Sebastien&apos;s &quot;CEA confirms that processes are waiting for ages on this global spin lock.&quot; comment.&lt;/p&gt;

&lt;p&gt;In fact, all involved threads (as the spin-lock counter evolving indicates!) one at a time acquire the spin-lock, then go thru the radix-tree and last release the spin-lock.&lt;/p&gt;

&lt;p&gt;And this pseudo-hang situation could be aggravated by the race on the spin-lock, the radix-tree search, and may be also &quot;false/unnecessary&quot; trips for the same pages ...&lt;/p&gt;</comment>
                            <comment id="16516" author="jay" created="Fri, 17 Jun 2011 00:16:36 +0000"  >&lt;p&gt;Indeed. From the dmesg in the attachment, though only a few CPUs(cpu 3, 4 and 10) were busy discarding pages, they were stuck at grabbing spin_lock. This is why I&apos;m thinking the contention on object&apos;s radix tree lock would be a problem. Another thing I&apos;m quite sure is that there must be tons of pages caching at the client side(This is because there is no cache limit as what we did in b18), so that the client had to take a lot of time to drain them. &lt;/p&gt;

&lt;p&gt;It seems that there is a lot of work to use lockless pagecache in clio as linux kernel does. Maybe we can limit # of caching pages at the client side so that a fast recovery is possible after an OST runs into problem.&lt;/p&gt;

&lt;p&gt;Also, it may be interesting to see what&apos;s going on at the OST side. The client lost connections to a couple of OSTs in a short time.&lt;/p&gt;</comment>
                            <comment id="16519" author="louveta" created="Fri, 17 Jun 2011 04:54:14 +0000"  >&lt;p&gt;About cache size at client side, I have trace where the client &apos;only&apos; have 3GB of cached data (not that much).&lt;br/&gt;
CPU time consumed by ldlm_bl &amp;amp; ll_imp_inval thread indicate that this situation was there for more that 13 hours, time that match also the difference beetwen now and the time the recall were issued. I didn&apos;t got traces to see if the &apos;cached&apos; size was moving in time, but at least after 30 minutes, the cleaning was not completed.&lt;/p&gt;

&lt;p&gt;Regarding time spend in various kernel code, we got time to run oprofile. 96% was spend into cl_page_gang_lookup, 0.7% in radix_tree_gang_lookup.&lt;/p&gt;</comment>
                            <comment id="16553" author="jay" created="Fri, 17 Jun 2011 15:50:49 +0000"  >&lt;p&gt;Can you please help me get those data while the client node is hung?&lt;/p&gt;

&lt;p&gt;lctl get_param osc.*.rpc_stats&lt;br/&gt;
lctl get_param ldlm.namespaces.*.lock_count&lt;br/&gt;
lctl get_param ldlm.namespaces.*.lock_unused_count&lt;br/&gt;
cat /proc/slabinfo  |grep cl_page&lt;/p&gt;

&lt;p&gt;Also, if it&apos;s possible, I&apos;d like to know all of processes state on the node(echo t &amp;gt; /proc/sysrq-trigger).&lt;/p&gt;

&lt;p&gt;I&apos;d like to see those output time by time(man watch(1)) so I can know what the processes are doing. I realize it will be a lot of time to implement lockless pagecache, so I&apos;d like to work out a workaround patch.&lt;/p&gt;

&lt;p&gt;Thank you so much.&lt;/p&gt;</comment>
                            <comment id="16581" author="sebastien.buisson" created="Mon, 20 Jun 2011 04:45:31 +0000"  >&lt;p&gt;Hi Jay,&lt;/p&gt;

&lt;p&gt;I have requested the data you are asking for to our on-site Support team.&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="16705" author="jay" created="Tue, 21 Jun 2011 15:45:15 +0000"  >&lt;p&gt;It looks like there is an infinite loop problem in cl_lock_page_out(). I&apos;m going to work out a patch to fix it.&lt;/p&gt;</comment>
                            <comment id="16744" author="sebastien.buisson" created="Wed, 22 Jun 2011 07:56:13 +0000"  >&lt;p&gt;OK thank you Jinshan, we are looking forward to your patch.&lt;br/&gt;
BTW, do you still need all the traces you asked for on July, 17th? because this is very complicated to take that sort of traces out of CEA.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="16775" author="jay" created="Wed, 22 Jun 2011 14:07:15 +0000"  >&lt;p&gt;Hi Sebastien,&lt;/p&gt;

&lt;p&gt;I&apos;m sorry, I still don&apos;t figure out the root cause of this issue, and there is a similar stack trace on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-437&quot; title=&quot;Client hang with spinning ldlm_bl_* and ll_imp_inval threads&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-437&quot;&gt;&lt;del&gt;LU-437&lt;/del&gt;&lt;/a&gt; where LLNL hit it with IOR, so we&apos;re reproducing it in our lab. Meanwhile, I suspect there would be a problem in cl_page_gang_lookup() which may cause infinite loop, this is why I&apos;d like you guys to try that patch, and maybe we can find something new with it.&lt;/p&gt;

&lt;p&gt;It will be great if I can get those data, because I&apos;d like to know if the system is in a livelock state or keep going. Anyway, it will be all right if we can reproduce it in our lab.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Jinshan&lt;/p&gt;</comment>
                            <comment id="16787" author="jay" created="Wed, 22 Jun 2011 14:18:53 +0000"  >&lt;p&gt;Can you please try patch at &lt;a href=&quot;http://review.whamcloud.com/#change,911&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,911&lt;/a&gt; if you have a test system?&lt;/p&gt;</comment>
                            <comment id="16822" author="louveta" created="Thu, 23 Jun 2011 03:05:17 +0000"  >&lt;p&gt;Jinshan,&lt;/p&gt;

&lt;p&gt;Waiting enough give time to the system to make progress and complete, but it takes hours (even days). It doesn&apos;t look to be a live lock (at least some complete). On Jun 16th, I got some numbers, specially the number of locks assigned to the &apos;slow&apos; client. Only 8 OSC had locks, and none ot them had more than 78 locks. The amount of buffer cache at this time was around 3GB.&lt;/p&gt;

&lt;p&gt;Alex.&lt;/p&gt;</comment>
                            <comment id="16853" author="jay" created="Thu, 23 Jun 2011 13:41:22 +0000"  >&lt;p&gt;So when this problem occurs, it takes too much time for the osc to write out all caching pages. This may be due to the deficiency in the implementation of cl_page_gang_lookup(), definitely it can worsen contention of -&amp;gt;coh_page_guard and slow things down.&lt;/p&gt;</comment>
                            <comment id="17231" author="bfaccini" created="Mon, 4 Jul 2011 18:23:40 +0000"  >&lt;p&gt;Just one more comment which may demonstrate the coh_page_guard/coh_tree (ie, respectivelly spin-lock/radix-tree data structures to manage pages on a Client) current ineficiency when dealing with concurent access and with a huge number of pages, &quot;lctl set_param&lt;br/&gt;
ldlm_namespaces.*.lru_size=clear&quot; pseudo-hangs the same way like the other radix-tree competitors when &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot; succeeds to flush the pages (i assume via traditional Kernel algorithms) and unblocks the situation !!!&lt;/p&gt;</comment>
                            <comment id="17252" author="jay" created="Tue, 5 Jul 2011 12:32:09 +0000"  >&lt;p&gt;indeed, lru_size=clear will drop all of caching locks at the client side, which has the same effect of echo 1 &amp;gt; drop_caches and evicts a client node.&lt;/p&gt;

&lt;p&gt;Actually I&apos;m working on this issue at lu-437, can you please try the last patch at &lt;a href=&quot;http://review.whamcloud.com/#change,911&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,911&lt;/a&gt; to see if it works.&lt;/p&gt;</comment>
                            <comment id="18746" author="pjones" created="Fri, 5 Aug 2011 09:54:12 +0000"  >&lt;p&gt;Bull\CEA confirm that this issue was resolved by the LU394 patch&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10264" name="cmds.txt" size="441" author="sebastien.buisson" created="Wed, 15 Jun 2011 03:55:23 +0000"/>
                            <attachment id="10262" name="node1330_dmesg" size="253421" author="sebastien.buisson" created="Wed, 15 Jun 2011 03:55:23 +0000"/>
                            <attachment id="10263" name="node1330_lctl_dk" size="486849" author="sebastien.buisson" created="Wed, 15 Jun 2011 03:55:23 +0000"/>
                            <attachment id="10270" name="radix-intro.pdf" size="43621" author="jay" created="Thu, 16 Jun 2011 15:26:11 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                    <customfield id="customfield_10020" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Bugzilla ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>23398.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvsnr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8547</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>