<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:35:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3641] Dropping caches on SLES11SP2 hangs in shrink_slab()/ldlm_pools_cli_shrink() path</title>
                <link>https://jira.whamcloud.com/browse/LU-3641</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Dropping caches (echo 3 &amp;gt; /proc/vm/drop_caches) occasionally does not complete on a client running SLES11SP2 when a Lustre file system is mounted. The kernel shrink_slab() function gets stuck in an infinite loop calling ldlm_pools_cli_shrink() because Lustre does not initialize the new batch field of the shrinker struct when it registers shrinkers.&lt;/p&gt;

&lt;p&gt;cfs_set_shrinker() kmallocs a shrinker struct; it neither zero fills the struct nor explicitly sets the batch field. Occasionally the uninitialized batch value is negative. The kernel shrink_slab() function uses the batch value to control its loop around the calls to each shrinker. If the batch value is negative, the loop never terminates.&lt;/p&gt;

&lt;p&gt;The interesting code bits are:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
static inline
struct cfs_shrinker *cfs_set_shrinker(int seek, cfs_shrinker_t func)
{
        struct shrinker *s; 

        s = kmalloc(sizeof(*s), GFP_KERNEL);
        if (s == NULL)
                return (NULL);

        s-&amp;gt;shrink = func;
        s-&amp;gt;seeks = seek;

        register_shrinker(s);

        return s;
}

kernel/include/linux/mm.h
struct shrinker {
        int (*shrink)(struct shrinker *, struct shrink_control *sc);
        int seeks;      /* seeks to recreate an obj */
        long batch;     /* reclaim batch size, 0 = default */

        /* These are for internal use */
        struct list_head list;
        long nr;        /* objs pending delete */
};

unsigned long shrink_slab(struct shrink_control *shrink,
                          unsigned long nr_pages_scanned,
                          unsigned long lru_pages)
{
[skip] 
                long batch_size = shrinker-&amp;gt;batch ? shrinker-&amp;gt;batch
                                                  : SHRINK_BATCH;
[skip]
                /* total_scan initialized to something positive */
                while (total_scan &amp;gt;= batch_size) {
[skip]
                        shrink_ret = do_shrinker_shrink(shrinker, shrink,
                                                        batch_size);
[skip]
                        total_scan -= batch_size;
[skip] 
                }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When this problem occurs, stack traces of the hanging process look similar to the following, although the exact location along the ldlm_pools_cli_shrink() path varies.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt 8994
PID: 8994   TASK: ffff8808334747b0  CPU: 24  COMMAND: &quot;apinit&quot;
 #0 [ffff8807f0a29c58] schedule at ffffffff81362e57
 #1 [ffff8807f0a29ca0] ldlm_bl_to_thread_list at ffffffffa0470d40 [ptlrpc]
 #2 [ffff8807f0a29cb0] ldlm_cancel_lru at ffffffffa046c385 [ptlrpc]
 #3 [ffff8807f0a29d00] ldlm_cli_pool_shrink at ffffffffa047896d [ptlrpc]
 #4 [ffff8807f0a29d40] ldlm_pool_shrink at ffffffffa0476568 [ptlrpc]
 #5 [ffff8807f0a29d70] ldlm_pools_shrink at ffffffffa0477ebc [ptlrpc]
 #6 [ffff8807f0a29dc0] ldlm_pools_cli_shrink at ffffffffa0477f5b [ptlrpc]
 #7 [ffff8807f0a29dd0] shrink_slab at ffffffff810fec7a
 #8 [ffff8807f0a29e70] drop_caches_sysctl_handler at ffffffff81164eb2
 #9 [ffff8807f0a29ea0] proc_sys_call_handler at ffffffff811a08a0
#10 [ffff8807f0a29f00] proc_sys_write at ffffffff811a08c4
#11 [ffff8807f0a29f10] vfs_write at ffffffff8113cf1b
#12 [ffff8807f0a29f40] sys_write at ffffffff8113d0c5
#13 [ffff8807f0a29f80] system_call_fastpath at ffffffff8136cc2b

&amp;gt; crash-7.0.0&amp;gt; shrinker 0xffff8803fb004340
&amp;gt; struct shrinker {
&amp;gt;   shrink = 0xffffffffa03c1130 &amp;lt;ldlm_pools_cli_shrink&amp;gt;, 
&amp;gt;   seeks = 2, 
&amp;gt;   batch = -4, 
&amp;gt;   list = {
&amp;gt;     next = 0xffffffff815a6e40 &amp;lt;shrinker_list&amp;gt;, 
&amp;gt;     prev = 0xffff8803fb004398
&amp;gt;   }, 
&amp;gt;   nr = 0
&amp;gt; }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Linux change that triggered this problem is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;author	Dave Chinner &amp;lt;dchinner@redhat.com&amp;gt;	2011-07-08 04:14:37 (GMT)&lt;br/&gt;
committer	Al Viro &amp;lt;viro@zeniv.linux.org.uk&amp;gt;	2011-07-20 05:44:32 (GMT)&lt;br/&gt;
commit	e9299f5058595a655c3b207cda9635e28b9197e6 (patch)&lt;br/&gt;
tree	b31a4dc5cab98ee1701313f45e92e583c2d76f63&lt;br/&gt;
parent	3567b59aa80ac4417002bf58e35dce5c777d4164 (diff)&lt;br/&gt;
vmscan: add customisable shrinker batch size&lt;br/&gt;
For shrinkers that have their own cond_resched* calls, having shrink_slab break the work down into small batches is not paticularly efficient. Add a custom batchsize field to the struct shrinker so that shrinkers can use a larger batch size if they desire. A value of zero (uninitialised) means &quot;use the default&quot;, so behaviour is unchanged by this patch.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Note: ldlm_pools_srv_shrink() does not exhibit this problem because it always returns -1, which causes shrink_slab to break out of its loop.&lt;/p&gt;</description>
                <environment></environment>
        <key id="20009">LU-3641</key>
            <summary>Dropping caches on SLES11SP2 hangs in shrink_slab()/ldlm_pools_cli_shrink() path</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="amk">Ann Koehler</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Thu, 25 Jul 2013 20:52:17 +0000</created>
                <updated>Fri, 13 Sep 2013 04:01:50 +0000</updated>
                            <resolved>Wed, 31 Jul 2013 19:16:45 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                    <fixVersion>Lustre 2.4.1</fixVersion>
                    <fixVersion>Lustre 2.5.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="63010" author="amk" created="Thu, 25 Jul 2013 22:23:36 +0000"  >&lt;p&gt;Submitted patch that zero fills the shrinker struct when cfs_set_shrinker allocates it. An alternative would be to explicitly set the batch field but this would require a conditional compilation since not all Linux kernels support the shrink batch size feature. Furthermore, just initializing the batch field is only part of the job. It may make sense to add full support in the future and allow each Lustre shrinker to specify its own batch size.&lt;/p&gt;

&lt;p&gt;Patch: &lt;a href=&quot;http://review.whamcloud.com/7122&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7122&lt;/a&gt;&lt;/p&gt;
</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvw8n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9375</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>