<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:22:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15917] Memory pressure prevents debug daemon / debug buffer from proper operation</title>
                <link>https://jira.whamcloud.com/browse/LU-15917</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Attempting to collect detailed data for some investigation, I noticed frequent reports from debug daemon that the buffer is overflowing, but then the actual statistic caught my eye:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[1243871.266883] debug daemon buffer overflowed; discarding 10% of pages (1 of 1)
[1243871.270173] debug daemon buffer overflowed; discarding 10% of pages (1 of 0) &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So this is in effect telling us that tcd-&amp;gt;tcd_cur_pages is 0 (or 1) while I know the tcd-&amp;gt;tcd_max_pages cannot be any less that 1500.&lt;/p&gt;

&lt;p&gt;Which in turn means we are having an allocation failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&#160; &#160; &#160; &#160; if (tcd-&amp;gt;tcd_cur_pages &amp;lt; tcd-&amp;gt;tcd_max_pages) {
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; if (tcd-&amp;gt;tcd_cur_stock_pages &amp;gt; 0) {
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; tage = cfs_tage_from_list(tcd-&amp;gt;tcd_stock_pages.prev);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; --tcd-&amp;gt;tcd_cur_stock_pages;
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; list_del_init(&amp;amp;tage-&amp;gt;linkage);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; } else {
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; tage = cfs_tage_alloc(GFP_ATOMIC);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; if (unlikely(tage == NULL)) {
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; if ((!memory_pressure_get() ||
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;in_interrupt()) &amp;amp;&amp;amp; printk_ratelimit())
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; printk(KERN_WARNING
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;&quot;cannot allocate a tage (%ld)\n&quot;,
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;tcd-&amp;gt;tcd_cur_pages);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; return NULL;
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; }
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; } &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but there&apos;s no printk which is a bit puzzling. Perhaps just due to mem_pressure_get() returning 1? And that would be due to increased memory pressure from caching on OSSes?&lt;/p&gt;

&lt;p&gt;There were other anecdotal reports and observations that lctl dk tends to only have a lot of old data and very little new messages when not doing the debug daemon that could be explained by the same effect?&lt;/p&gt;

&lt;p&gt;This is also hand-in hand with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15916&quot; title=&quot;stock pages for debug buffer use are never filled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15916&quot;&gt;LU-15916&lt;/a&gt; where the supposed reserve pages for debug buffer use are never filled.&lt;/p&gt;

&lt;p&gt;Sounds like we need to reinstate some sort of preallocated pages list(s).&lt;/p&gt;

&lt;p&gt;Basically the way I see it is:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;We will retain the current &quot;allocate first&quot; logic but if the allocation fails we will use a page from an emergency buffer that&apos;s always preallocated at some level&lt;/li&gt;
	&lt;li&gt;We reinstate the stock pages thing, will likely need TCD_STOCK_PAGES&#160; updated to something smaller as currently it&apos;s hardcoded to TCD_STOCK_PAGES = 5 megabytes.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Both approaches would need to ensure we call the buffer refill from somewhere once the pages are actually consumed.&lt;/p&gt;

&lt;p&gt;Additionally I guess we can try to divine when a non-atomic allocation is possible and actually perform that one? That would have a potential performance impact though and is less desirable? All in all having preallocated pages in some form seems the most efficient?&lt;/p&gt;

&lt;p&gt;Also while we are looking into this dusty corner, perhaps we finally could do something with the arbitrary 80/10/10 split of pages for debug buffers. Restoring close to full LRU behavior seems desirable as long as we can actually achieve it without too much locking.&lt;/p&gt;

&lt;p&gt;Something along the lines of &quot;allocate pages with abandon until we hit debug_mb value, discard oldest 10% once the limit is met&quot;. Of course we need to figure out how to actually find the oldest 10% of pages efficiently. Having a single list with corresponding locking likely going to be pretty expensive and negates the whole per-cpu stuff in place.&lt;/p&gt;

&lt;p&gt;Alternatively we can just iterate all TCDs from a separate task as we are getting close and discarding pages, but that has own complication like comparing oldest pages in different TCDs to see which ones are to be dropped and so on.&lt;/p&gt;

&lt;p&gt;I wonder if &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=neilb&quot; class=&quot;user-hover&quot; rel=&quot;neilb&quot;&gt;neilb&lt;/a&gt; has any smart ideas here by any chance?&lt;/p&gt;

&lt;p&gt;Of course the radical alternative is to get rid of this all and actually do convert to the tracepoints, but the problem here is we were not able to get the functionality we wanted from that in the past despite several attempts by &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=simmonsja&quot; class=&quot;user-hover&quot; rel=&quot;simmonsja&quot;&gt;simmonsja&lt;/a&gt;&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="70647">LU-15917</key>
            <summary>Memory pressure prevents debug daemon / debug buffer from proper operation</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="green">Oleg Drokin</reporter>
                        <labels>
                    </labels>
                <created>Mon, 6 Jun 2022 22:50:07 +0000</created>
                <updated>Thu, 16 Jun 2022 17:40:22 +0000</updated>
                                            <version>Lustre 2.16.0</version>
                    <version>Lustre 2.12.9</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="336877" author="neilb" created="Tue, 7 Jun 2022 01:51:41 +0000"  >&lt;p&gt;cfs_tcd_shrink() will never cause tcp_cur_pages to shrink below 10, then pgcount would be zero and the loop aborts immediately.&lt;/p&gt;

&lt;p&gt;But your first two log messages show that tcp_cur_pages as 1, then 0.&#160; I can only see that happening if collect_pages() runs.&#160; So presumable the log was extracted between these two points?&lt;/p&gt;

&lt;p&gt;The broader implication is that all logging is happening in threads where PF_MEMALLOC is set.&#160; That seems a little unlikely - unless you have only enabled tracing for things that happen during writeback...&#160; Even then it is a bit of a stretch.&lt;/p&gt;

&lt;p&gt;A fairly simple change that might help would be for cfs_tage_free() to move the page to tcd_stock_pages unless tcd_shutting_down were set or tcd_cur_stock_pages were too large.&#160; That would make complete exhaustion less likely.&lt;/p&gt;</comment>
                            <comment id="336910" author="green" created="Tue, 7 Jun 2022 13:57:37 +0000"  >&lt;p&gt;I am not sure why you think cfs_tcd_shrink() does not shrink pages below zero? My reading of it is if there are 10 pages or less it would just remove one page at all times?&lt;/p&gt;

&lt;p&gt;The log message is generated to dmesg and I don&apos;t know if it&apos;s even for the same tcd, it only happens when new log page cannot be allocated. I noticed the messages were coming in batches.&lt;/p&gt;

&lt;p&gt;I did think about moving free pages to stock pages and might try experimenting with it, but that would only help in part of the scenarios, granted probably majority we care about anyway.&lt;/p&gt;

&lt;p&gt;Unhandled case would be if there are no stock pages because we did not reach max pages yet yet we are doing an allocation and it fails.&lt;/p&gt;

&lt;p&gt;The system where I experienced this at had 150G RAM and I tried to set debug buffer size to 50G after 1-5-10-20 did not yield the messages disappearance. Amount of actually free RAM (less buffers) was kinda low, 3-4G. This means that even if there are allocations from all sorts of places, some are from the context with PF_MEMALLOC set and those are the ones failing and emitting that message? I guess I need to atry and add more debug though since it&apos;s a customer site - that&apos;s not all that easy as it is on my dev nodes.&lt;/p&gt;</comment>
                            <comment id="337118" author="neilb" created="Thu, 9 Jun 2022 05:11:59 +0000"  >&lt;p&gt;cfs_tcd_shrink() initialises &quot;pgcount = tcd-&amp;gt;tcd_cur_pages / 10;&quot; which will be zero when there are fewer than 10 pages.&lt;/p&gt;

&lt;p&gt;The loop that frees pages starts &quot;if (pgcount-- == 0) break;&quot;&#160; As pgcount (before being decremented) is 0, this will break at the start of the first loop, so nothing will be freed.&#160; At least, that is how I read it.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="337200" author="green" created="Thu, 9 Jun 2022 19:26:16 +0000"  >&lt;p&gt;pgcount-- == 0 is true for the value of pgcount of 0 because of the post-decrement, right?&lt;/p&gt;

&lt;p&gt;I had to doublecheck this just to make sure my understanding of it is actually correct:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[green@fatbox3 ~]$ cat /tmp/test.c&#160;
#include &amp;lt;stdio.h&amp;gt;
void main(void)
{
&#160; &#160; int test = 0;&#160; &#160; printf(&quot;value1 is %d\n&quot;, test-- == 0);
&#160; &#160; test = 0;
&#160; &#160; printf(&quot;value2 is %d\n&quot;, --test == 0);&#160; &#160; return;
}
[green@fatbox3 ~]$ gcc -o /tmp/test /tmp/test.c&#160;
[green@fatbox3 ~]$ /tmp/test&#160;
value1 is 1
value2 is 0
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Anyway looking at this code I think the easiest way to address this might be to actually revert your &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14428&quot; title=&quot;Convert tracefile to use ring_buffer  from linux&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14428&quot;&gt;LU-14428&lt;/a&gt; &lt;a href=&quot;https://review.whamcloud.com/41493&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41493&lt;/a&gt; patch.&lt;/p&gt;

&lt;p&gt;Instead we&apos;ll treat the daemon list as the stok pages list (we&apos;ll remove stok pages list itself as not needed) with a pretty trivial patch like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;--- a/libcfs/libcfs/tracefile.c
+++ b/libcfs/libcfs/tracefile.c
@@ -155,6 +155,11 @@ cfs_trace_get_tage_try(struct cfs_trace_cpu_data *tcd, unsigned long len)
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; tage = cfs_tage_from_list(tcd-&amp;gt;tcd_stock_pages.prev);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; --tcd-&amp;gt;tcd_cur_stock_pages;
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; list_del_init(&amp;amp;tage-&amp;gt;linkage);
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; } if (!list_empty(&amp;amp;tcd-&amp;gt;tcd_daemon_pages)) {
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; /* If we have written daemon pages, grab oldest one */
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; tage = cfs_tage_from_list(tcd-&amp;gt;tcd_daemon_pages.next);
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; --tcd-&amp;gt;tcd_cur_daemon_pages;
+ &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; list_del_init(&amp;amp;tage-&amp;gt;linkage);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; } else {
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; tage = cfs_tage_alloc(GFP_ATOMIC);
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; if (unlikely(tage == NULL)) { &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This way when we have daemon running already written pages would be the stok buffer (I had no idea we stored the already written pages!) and when we do not have the daemon running we will cannibalize the oldest page if we fail to allocate one.&lt;/p&gt;

&lt;p&gt;Just change the cfs_tcd_shrink() to actual pre-decrement to avoid discarding pages at all if there are less than 10 pages because we will use it up in the next step anyway and we should be ok to go I guess?&lt;/p&gt;

&lt;p&gt;The other thing I thought about was - to mark the just written pages as &quot;flush and discard&quot; so they don&apos;t hog memory in buffer cache needlessly. But I am not sure we have a way to do this asynchronously? I think last time I looked at it there was only a synchronous discard?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;What do you think?&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02riv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>