<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:09:29 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14408] very large lustre_inode_cache</title>
                <link>https://jira.whamcloud.com/browse/LU-14408</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@solfish2:~]# cat /tmp/t4
 Active / Total Objects (% used)&#160; &#160; : 508941033 / 523041216 (97.3%)
 Active / Total Slabs (% used)&#160; &#160; &#160; : 11219941 / 11219941 (100.0%)
 Active / Total Caches (% used) &#160; &#160; : 87 / 122 (71.3%)
 Active / Total Size (% used) &#160; &#160; &#160; : 112878003.58K / 114522983.04K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K

OBJS&#160; &#160; &#160; ACTIVE&#160; &#160; USE &#160; OBJ_SIZE&#160; SLABS&#160; &#160; OBJ/SLAB&#160; CACHE_SIZE&#160; NAME
30545252&#160; 30067595&#160; 98% &#160; 1.12K &#160; &#160; 1092909&#160; 28&#160; &#160; &#160; &#160; 34973088K &#160; ptlrpc_cache
92347047&#160; 92347047&#160; 99% &#160; 0.31K &#160; &#160; 1810744&#160; 51&#160; &#160; &#160; &#160; 28971904K &#160; bio-3
92346672&#160; 92346672&#160; 100%&#160; 0.16K &#160; &#160; 1923889&#160; 48&#160; &#160; &#160; &#160; 15391112K &#160; xfs_icr
92409312&#160; 92409312&#160; 100%&#160; 0.12K &#160; &#160; 2887791&#160; 32&#160; &#160; &#160; &#160; 11551164K &#160; kmalloc-128
25717818&#160; 23912628&#160; 92% &#160; 0.19K &#160; &#160; 612329 &#160; 42&#160; &#160; &#160; &#160; 4898632K&#160; &#160; kmalloc-192
25236420&#160; 24708346&#160; 97% &#160; 0.18K &#160; &#160; 573555 &#160; 44&#160; &#160; &#160; &#160; 4588440K&#160; &#160; xfs_log_ticket
25286568&#160; 24717197&#160; 97% &#160; 0.17K &#160; &#160; 549708 &#160; 46&#160; &#160; &#160; &#160; 4397664K&#160; &#160; xfs_ili
14103054&#160; 13252206&#160; 93% &#160; 0.19K &#160; &#160; 335787 &#160; 42&#160; &#160; &#160; &#160; 2686296K&#160; &#160; dentry
...
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The ptlrpc_cache shrinks from GB to MB in size upon&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;echo 2 &amp;gt; /proc/sys/vm/drop_caches&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This particular node has 128GB of RAM, so this represents a very large portion of the total.&lt;/p&gt;

&lt;p&gt;After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge.&#160; After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.&lt;/p&gt;

&lt;p&gt;At the same time I saw this I saw kthread_run() and fork() failures reported in the console log. &#160;Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.&lt;/p&gt;</description>
                <environment>3.10.0-1160.4.1.1chaos.ch6.x86_64&lt;br/&gt;
Server: lustre-2.12.6_3.llnl-1.ch6.x86_64&lt;br/&gt;
Client: lustre-2.14.0-something&lt;br/&gt;
starfish &amp;quot;agent&amp;quot;</environment>
        <key id="62800">LU-14408</key>
            <summary>very large lustre_inode_cache</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 10 Feb 2021 01:44:41 +0000</created>
                <updated>Thu, 25 Jan 2024 17:18:54 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="291584" author="ofaaland" created="Wed, 10 Feb 2021 01:46:29 +0000"  >&lt;p&gt;I wonder if it&apos;s related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13909&quot; title=&quot;release invalid dentries proactively on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13909&quot;&gt;&lt;del&gt;LU-13909&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="291596" author="green" created="Wed, 10 Feb 2021 04:52:39 +0000"  >&lt;p&gt;Modern kernels seem to combine unrelated slub caches of same size items together so it could be something else (likely something lock/inode related?)&lt;/p&gt;

&lt;p&gt;there&apos;s an option to separate them to know for sure - set slab_nomerge/slub_nomerge in kernel commandline.&lt;/p&gt;

&lt;p&gt;I guess if we assume it&apos;s the same issue as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13909&quot; title=&quot;release invalid dentries proactively on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13909&quot;&gt;&lt;del&gt;LU-13909&lt;/del&gt;&lt;/a&gt; that might be the case too, though I do see dentry counts is 1/3 of ptlrpc cache object counts so there might be some other factor as well.&lt;/p&gt;</comment>
                            <comment id="291636" author="ofaaland" created="Wed, 10 Feb 2021 17:44:08 +0000"  >&lt;p&gt;Thanks Oleg, that&apos;s really helpful.  I&apos;ll set those commandline options and see what things look like.&lt;/p&gt;</comment>
                            <comment id="291796" author="ofaaland" created="Thu, 11 Feb 2021 20:37:42 +0000"  >&lt;p&gt;After using slab_nomerge / slub_nomerge&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@solfish2:~]# free -g
&#160; &#160; &#160; &#160; &#160; &#160; &#160; total&#160; &#160; &#160; &#160; used&#160; &#160; &#160; &#160; free&#160; &#160; &#160; shared&#160; buff/cache &#160; available
Mem:&#160; &#160; &#160; &#160; &#160; &#160; 125 &#160; &#160; &#160; &#160; 121 &#160; &#160; &#160; &#160; &#160; 1 &#160; &#160; &#160; &#160; &#160; 0 &#160; &#160; &#160; &#160; &#160; 3 &#160; &#160; &#160; &#160; &#160; 3
Swap: &#160; &#160; &#160; &#160; &#160; 127 &#160; &#160; &#160; &#160; &#160; 0 &#160; &#160; &#160; &#160; 127

[root@solfish2:~]#  slabtop --sort c --once | head -n 20
 Active / Total Objects (% used)    : 501176259 / 520226233 (96.3%)
 Active / Total Slabs (% used)      : 10235735 / 10235735 (100.0%)
 Active / Total Caches (% used)     : 170 / 251 (67.7%)
 Active / Total Size (% used)       : 111121653.05K / 112324794.23K (98.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K

OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
45514686  45514686  100%  1.12K     1626337  28        52042784K   lustre_inode_cache
44019375  44019375  99%   0.31K     863125   51        13810000K   osc_object_kmem
42476616  42476616  100%  0.19K     1011348  42        8090784K    kmalloc-192
42464092  42464092  100%  0.18K     965093   44        7720744K    lov_object_kmem
42464118  42464118  100%  0.17K     923133   46        7385064K    vvp_object_kmem
44019360  44019360  100%  0.16K     917070   48        7336560K    lovsub_object_kmem
44019200  44019200  100%  0.12K     1375600  32        5502400K    lov_oinfo
42729162  42720524  99%   0.09K     1017361  42        4069444K    kmalloc-96
59263424  43529019  73%   0.06K     925991   64        3703964K    kmalloc-64
11839254  11771513  99%   0.19K     281887   42        2255096K    dentry
84897280  84897280  100%  0.01K     165815   512       663260K     kmalloc-8
14558592  11710836  80%   0.03K     113739   128       454956K     kmalloc-32
170612    169231    99%   0.94K     5018     34        160576K     xfs_inode
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="291800" author="ofaaland" created="Thu, 11 Feb 2021 21:11:25 +0000"  >&lt;p&gt;So it seems the lustre_inode_cache is what is really taking up the space, and as you said had been merged with the ptlrpc_cache.&#160; I learned about vfs_cache_pressure today, which is currently set to the default value of 100.&#160; I&apos;ll try increasing that to some larger value.&lt;/p&gt;

&lt;p&gt;Are the objects in lustre_inode_cache there because VFS has cached corresponding objects, and so the really high memory usage is just a side-effect of something the kernel is doing, like caching dentries?&lt;/p&gt;</comment>
                            <comment id="291814" author="ofaaland" created="Thu, 11 Feb 2021 22:45:13 +0000"  >&lt;p&gt;Setting vfs_cache_pressure to 200 and then to 1000 seemed to make nod difference.&#160; But the system seems stable even though free memory is low, so maybe the problem is a bug related to cache merging.&lt;/p&gt;</comment>
                            <comment id="291931" author="adilger" created="Sat, 13 Feb 2021 10:13:44 +0000"  >&lt;p&gt;What is strange here is that the number of cached inodes looks to be far more than the number of locks held by the client, since &lt;tt&gt;ldlm_lock&lt;/tt&gt; and &lt;tt&gt;ldlm_resource&lt;/tt&gt; do not even appear in this list.  The one patch in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13909&quot; title=&quot;release invalid dentries proactively on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13909&quot;&gt;&lt;del&gt;LU-13909&lt;/del&gt;&lt;/a&gt; should help free inodes that are no longer under a lock, but in testing that patch we still found that directory inodes were kept in cache a long time because they are having &lt;em&gt;subdirectories&lt;/em&gt; that are also in cache and keeping them pinned.&lt;/p&gt;

&lt;p&gt;Maybe the issue is just that Starfish + Lustre is accessing a filesystem with many millions of directories (about 45M it would appear?), and the VFS is not doing very well at dealing with that case when it has so much RAM?  It may be that we need to do some extra work in Lustre to try and free the parent directory if it no longer has any child dentries on it, and does not have a lock?   I see several network filesystems are calling &lt;tt&gt;shrink_dcache_parent()&lt;/tt&gt;, but I haven&apos;t looked into the exact details of where this is called.&lt;/p&gt;</comment>
                            <comment id="292101" author="ofaaland" created="Tue, 16 Feb 2021 20:50:36 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;That&apos;s a good point.&#160; I&apos;ll find out how many directories there are in this file system.&#160; 45M is entirely possible.&lt;/p&gt;</comment>
                            <comment id="292104" author="simmonsja" created="Tue, 16 Feb 2021 21:11:50 +0000"  >&lt;p&gt;Their does exist a hook in newer kernels to shrink the inode cache. See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13833&quot; title=&quot;hook llite to inode cache shrinker&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13833&quot;&gt;LU-13833&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It appears to be supported in RHEL7 as well.&lt;/p&gt;</comment>
                            <comment id="292105" author="ofaaland" created="Tue, 16 Feb 2021 21:55:03 +0000"  >&lt;p&gt;Andreas, there are very likely  around 45M directories on the file system being scanned.  It had 47M last May.&lt;/p&gt;</comment>
                            <comment id="292139" author="adilger" created="Wed, 17 Feb 2021 08:49:17 +0000"  >&lt;p&gt;Olaf, in the meantime, for clients like this that run filesystem scanners like Starfish, RobinHood, etc. you can run &quot;&lt;tt&gt;echo 2 &amp;gt; /proc/sys/vm/drop_caches&lt;/tt&gt;&quot;periodically (eg. every 10 minutes from cron) to drop the dentries and inodes out of the cache. &lt;/p&gt;</comment>
                            <comment id="292189" author="ofaaland" created="Wed, 17 Feb 2021 16:53:04 +0000"  >&lt;p&gt;Thanks, Andreas.&lt;/p&gt;</comment>
                            <comment id="292686" author="ofaaland" created="Mon, 22 Feb 2021 21:50:18 +0000"  >&lt;p&gt;The other issue occurring on this node, kthread_create() failures, seems not (at least not obviously) related to this.  kthread_create() failures &quot;went away&quot; even under low memory.   Removing topllnl.&lt;/p&gt;</comment>
                            <comment id="314827" author="ofaaland" created="Wed, 6 Oct 2021 16:20:46 +0000"  >&lt;p&gt;I&apos;ve updated this node to Lustre 2.14 and added a crontab to drop caches every 10 minutes.&#160; This improves the situation, so it happens less often, but I still see intermittent failures to create threads etc. with ENOMEM.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[Mon Sep 13 09:26:52 2021] bash (10120): drop_caches: 2
[Mon Sep 13 09:36:49 2021] bash (28252): drop_caches: 2
[Mon Sep 13 09:46:33 2021] bash (4675): drop_caches: 2
[Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) can&apos;t start ll_sa thread, rc: -12
[Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) Skipped 3 previous similar messages
[Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) can&apos;t start ll_agl thread, rc: -12
[Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) Skipped 2 previous similar messages
[Mon Sep 13 09:56:31 2021] LustreError: 5295:0:(statahead.c:1614:start_statahead_thread()) can&apos;t start ll_sa thread, rc: -12 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="314858" author="simmonsja" created="Wed, 6 Oct 2021 19:25:55 +0000"  >&lt;p&gt;The two struct super_operations you need to implement are:&lt;/p&gt;

&lt;p&gt;long (*nr_cached_objects)(struct super_block *,&#160;struct shrink_control *);&lt;/p&gt;

&lt;p&gt;long (*free_cached_objects)(struct super_block *,&#160;struct shrink_control *);&lt;/p&gt;

&lt;p&gt;nr_cached_objects returns the number of inodes you can free and free_cached_objects does the actually freeing of the inodes.&#160; xfs currently is the only one that implements this.&#160;&lt;/p&gt;

&lt;p&gt;This can be done under ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13833&quot; title=&quot;hook llite to inode cache shrinker&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13833&quot;&gt;LU-13833&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="317490" author="ofaaland" created="Thu, 4 Nov 2021 18:37:45 +0000"  >&lt;p&gt;This problem is affecting us more as we try to scan/monitor more more file systems.  The fork() failures in particular cause other processes (ie those trying to fetch and process changelogs) to fail making the system quite fragile.  I need some help finding a better workaround or a fix.&lt;/p&gt;

&lt;p&gt;Note that I&apos;m running the 2.14 client on this system, so it has the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13909&quot; title=&quot;release invalid dentries proactively on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13909&quot;&gt;&lt;del&gt;LU-13909&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;James, I don&apos;t understand why a per-superblock shrinker will help with this.  I would think that the shrinker either isn&apos;t running or is running and being told nothing can be freed, or I wouldn&apos;t be seeing this issue.  What am I missing?&lt;/p&gt;

&lt;p&gt;Is there some evidence I can gather to evaluate Andreas&apos; theory about subdirectories keeping their parent cached?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="317782" author="adilger" created="Tue, 9 Nov 2021 22:57:00 +0000"  >&lt;p&gt;There is patch &lt;a href=&quot;https://review.whamcloud.com/40011&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40011&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13983&quot; title=&quot;rmdir should release inode on Lustre client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13983&quot;&gt;&lt;del&gt;LU-13983&lt;/del&gt;&lt;/a&gt; llite: rmdir releases inode on client&lt;/tt&gt;&quot; that was included in 2.14.0 but not backported to 2.12.  It may help reduce the number of cached inodes on the client, but is specifically for when a directory is removed on a client.&lt;/p&gt;

&lt;p&gt;Also, as a short-term workaround, the patch &lt;a href=&quot;https://review.whamcloud.com/39973&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39973&lt;/a&gt; &quot;&lt;tt&gt;Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13970&quot; title=&quot;add an option to disable inode cache on Lustre client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13970&quot;&gt;&lt;del&gt;LU-13970&lt;/del&gt;&lt;/a&gt; llite: add option to disable inode cache&lt;/tt&gt;&quot; adds an option &quot;&lt;tt&gt;llite.&amp;#42;.inode_cache=off&lt;/tt&gt;&quot; on the client that can be used to disable the Lustre inode cache on the few clients that are doing full-filesystem scans, because they are unlikely to re-use the inode cache anyway, due to age and/or memory pressure.   &lt;/p&gt;</comment>
                            <comment id="317787" author="ofaaland" created="Tue, 9 Nov 2021 23:21:33 +0000"  >&lt;p&gt;Andreas, thanks for those ideas.  This node where we see this problem is running 2.14, so it must have patch 40011 for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13983&quot; title=&quot;rmdir should release inode on Lustre client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13983&quot;&gt;&lt;del&gt;LU-13983&lt;/del&gt;&lt;/a&gt;.  Also, this client isn&apos;t modifying the file system at all - in fact, Lustre is mounted R/O at the moment.&lt;/p&gt;

&lt;p&gt;I&apos;ll add patch 39973 for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13970&quot; title=&quot;add an option to disable inode cache on Lustre client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13970&quot;&gt;&lt;del&gt;LU-13970&lt;/del&gt;&lt;/a&gt; to our build and disable the Lustre inode cache.&lt;/p&gt;</comment>
                            <comment id="318914" author="ofaaland" created="Mon, 22 Nov 2021 22:25:13 +0000"  >&lt;p&gt;Hi Andreas, patch 39973 applied cleanly and looks reasonable to me, but the inode_cache procfile isn&apos;t created when I mount a client.  I&apos;m off the rest of the week, so I&apos;ll figure out what&apos;s wrong next week and let you know if I need help with it.&lt;/p&gt;</comment>
                            <comment id="319910" author="ofaaland" created="Thu, 2 Dec 2021 17:00:52 +0000"  >&lt;p&gt;My mistake, patch 39973 did create the inode_cache sysfs file on the node.  I&apos;ll test with inode_cache=0 over the next week.&lt;/p&gt;</comment>
                            <comment id="322621" author="ofaaland" created="Thu, 13 Jan 2022 19:27:06 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;Patch 39973 was pulled into our stack and setting node_cache=0 has improved the situation.&#160; Is this acceptable for pulling into master, or do you want to investigate other options?&lt;/p&gt;</comment>
                            <comment id="322664" author="adilger" created="Fri, 14 Jan 2022 02:55:02 +0000"  >&lt;p&gt;Olaf, 39973 is in reasonable shape to land, and could potentially still make it into 2.15 if it is refreshed and reviewed quickly. My preference is to fix the root of the problem (inodes not being flushed from cache by the normal VM mechanisms), but 39973 is at least an option for specific workloads until the solution is found. &lt;/p&gt;</comment>
                            <comment id="322846" author="ofaaland" created="Sat, 15 Jan 2022 00:13:29 +0000"  >&lt;p&gt;I agree finding the root cause would be best.&#160; I can run debug patches and reproduce the issue.&#160; I&apos;m not sure whether I&apos;ll have time to learn enough to find the problem myself.&lt;/p&gt;</comment>
                            <comment id="337109" author="ofaaland" created="Wed, 8 Jun 2022 23:37:01 +0000"  >&lt;p&gt;On Nov 4th I wrote:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;This problem is affecting us more as we try to scan/monitor more more file systems. The fork() failures in particular cause other processes (ie those trying to fetch and process changelogs) to fail making the system quite fragile. I need some help finding a better workaround or a fix.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It turned out fork() and kthread_create() were failing not because of insufficient memory, but because sysctl kernel.pid_max was too small (this node is running RHEL 7 x86_64 with 32 cores and /sys/devices/system/cpu/possible reports 0-31, so pid_max was 32K by default).  When the kernel fails to create a new process because it can&apos;t allocate a PID, errno is set to ENOMEM even though that&apos;s easy to misinterpret.&lt;/p&gt;</comment>
                            <comment id="337243" author="green" created="Fri, 10 Jun 2022 03:50:08 +0000"  >&lt;p&gt;hm, that&apos;s certainly an interesting twist of events.&lt;/p&gt;

&lt;p&gt;Is it ok to close this now or is there anything we could do better here?&lt;/p&gt;</comment>
                            <comment id="337360" author="ofaaland" created="Sat, 11 Jun 2022 00:28:06 +0000"  >&lt;p&gt;&amp;gt; hm, that&apos;s certainly an interesting twist of events.&lt;/p&gt;

&lt;p&gt;Yes! Thanks for the help eliminating memory usage from the list of potential root causes.&lt;/p&gt;

&lt;p&gt;&amp;gt; Is it ok to close this now or is there anything we could do better here?&lt;/p&gt;

&lt;p&gt;For a workload like Starfish where it&apos;s just reading directories and stat-ing files, the memory usage for cached inodes and dnodes seems excessive. &#160;But it&apos;s not actually causing a problem that I&apos;m aware of, and so I think closing this is reasonable.&lt;/p&gt;

&lt;p&gt;I&apos;ll remove my incorrect speculation from the description so it doesn&apos;t mislead someone who comes across this ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="60188">LU-13833</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="60417">LU-13909</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="60914">LU-13983</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="60851">LU-13970</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01m53:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>