[LU-14408] very large lustre_inode_cache Created: 10/Feb/21  Updated: 25/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Oleg Drokin
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

3.10.0-1160.4.1.1chaos.ch6.x86_64
Server: lustre-2.12.6_3.llnl-1.ch6.x86_64
Client: lustre-2.14.0-something
starfish "agent"


Issue Links:
Related
is related to LU-13833 hook llite to inode cache shrinker Open
is related to LU-13909 release invalid dentries proactively ... Resolved
is related to LU-13983 rmdir should release inode on Lustre ... Resolved
is related to LU-13970 add an option to disable inode cache ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).

[root@solfish2:~]# cat /tmp/t4
 Active / Total Objects (% used)    : 508941033 / 523041216 (97.3%)
 Active / Total Slabs (% used)      : 11219941 / 11219941 (100.0%)
 Active / Total Caches (% used)     : 87 / 122 (71.3%)
 Active / Total Size (% used)       : 112878003.58K / 114522983.04K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K

OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
30545252  30067595  98%   1.12K     1092909  28        34973088K   ptlrpc_cache
92347047  92347047  99%   0.31K     1810744  51        28971904K   bio-3
92346672  92346672  100%  0.16K     1923889  48        15391112K   xfs_icr
92409312  92409312  100%  0.12K     2887791  32        11551164K   kmalloc-128
25717818  23912628  92%   0.19K     612329   42        4898632K    kmalloc-192
25236420  24708346  97%   0.18K     573555   44        4588440K    xfs_log_ticket
25286568  24717197  97%   0.17K     549708   46        4397664K    xfs_ili
14103054  13252206  93%   0.19K     335787   42        2686296K    dentry
...
 

The ptlrpc_cache shrinks from GB to MB in size upon

echo 2 > /proc/sys/vm/drop_caches

This particular node has 128GB of RAM, so this represents a very large portion of the total.

After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge.  After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.

At the same time I saw this I saw kthread_run() and fork() failures reported in the console log.  Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.



 Comments   
Comment by Olaf Faaland [ 10/Feb/21 ]

I wonder if it's related to LU-13909

Comment by Oleg Drokin [ 10/Feb/21 ]

Modern kernels seem to combine unrelated slub caches of same size items together so it could be something else (likely something lock/inode related?)

there's an option to separate them to know for sure - set slab_nomerge/slub_nomerge in kernel commandline.

I guess if we assume it's the same issue as LU-13909 that might be the case too, though I do see dentry counts is 1/3 of ptlrpc cache object counts so there might be some other factor as well.

Comment by Olaf Faaland [ 10/Feb/21 ]

Thanks Oleg, that's really helpful. I'll set those commandline options and see what things look like.

Comment by Olaf Faaland [ 11/Feb/21 ]

After using slab_nomerge / slub_nomerge

[root@solfish2:~]# free -g
              total        used        free      shared  buff/cache   available
Mem:            125         121           1           0           3           3
Swap:           127           0         127

[root@solfish2:~]#  slabtop --sort c --once | head -n 20
 Active / Total Objects (% used)    : 501176259 / 520226233 (96.3%)
 Active / Total Slabs (% used)      : 10235735 / 10235735 (100.0%)
 Active / Total Caches (% used)     : 170 / 251 (67.7%)
 Active / Total Size (% used)       : 111121653.05K / 112324794.23K (98.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K

OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
45514686  45514686  100%  1.12K     1626337  28        52042784K   lustre_inode_cache
44019375  44019375  99%   0.31K     863125   51        13810000K   osc_object_kmem
42476616  42476616  100%  0.19K     1011348  42        8090784K    kmalloc-192
42464092  42464092  100%  0.18K     965093   44        7720744K    lov_object_kmem
42464118  42464118  100%  0.17K     923133   46        7385064K    vvp_object_kmem
44019360  44019360  100%  0.16K     917070   48        7336560K    lovsub_object_kmem
44019200  44019200  100%  0.12K     1375600  32        5502400K    lov_oinfo
42729162  42720524  99%   0.09K     1017361  42        4069444K    kmalloc-96
59263424  43529019  73%   0.06K     925991   64        3703964K    kmalloc-64
11839254  11771513  99%   0.19K     281887   42        2255096K    dentry
84897280  84897280  100%  0.01K     165815   512       663260K     kmalloc-8
14558592  11710836  80%   0.03K     113739   128       454956K     kmalloc-32
170612    169231    99%   0.94K     5018     34        160576K     xfs_inode
Comment by Olaf Faaland [ 11/Feb/21 ]

So it seems the lustre_inode_cache is what is really taking up the space, and as you said had been merged with the ptlrpc_cache.  I learned about vfs_cache_pressure today, which is currently set to the default value of 100.  I'll try increasing that to some larger value.

Are the objects in lustre_inode_cache there because VFS has cached corresponding objects, and so the really high memory usage is just a side-effect of something the kernel is doing, like caching dentries?

Comment by Olaf Faaland [ 11/Feb/21 ]

Setting vfs_cache_pressure to 200 and then to 1000 seemed to make nod difference.  But the system seems stable even though free memory is low, so maybe the problem is a bug related to cache merging.

Comment by Andreas Dilger [ 13/Feb/21 ]

What is strange here is that the number of cached inodes looks to be far more than the number of locks held by the client, since ldlm_lock and ldlm_resource do not even appear in this list. The one patch in LU-13909 should help free inodes that are no longer under a lock, but in testing that patch we still found that directory inodes were kept in cache a long time because they are having subdirectories that are also in cache and keeping them pinned.

Maybe the issue is just that Starfish + Lustre is accessing a filesystem with many millions of directories (about 45M it would appear?), and the VFS is not doing very well at dealing with that case when it has so much RAM? It may be that we need to do some extra work in Lustre to try and free the parent directory if it no longer has any child dentries on it, and does not have a lock? I see several network filesystems are calling shrink_dcache_parent(), but I haven't looked into the exact details of where this is called.

Comment by Olaf Faaland [ 16/Feb/21 ]

Andreas,

That's a good point.  I'll find out how many directories there are in this file system.  45M is entirely possible.

Comment by James A Simmons [ 16/Feb/21 ]

Their does exist a hook in newer kernels to shrink the inode cache. See LU-13833.

It appears to be supported in RHEL7 as well.

Comment by Olaf Faaland [ 16/Feb/21 ]

Andreas, there are very likely around 45M directories on the file system being scanned. It had 47M last May.

Comment by Andreas Dilger [ 17/Feb/21 ]

Olaf, in the meantime, for clients like this that run filesystem scanners like Starfish, RobinHood, etc. you can run "echo 2 > /proc/sys/vm/drop_caches"periodically (eg. every 10 minutes from cron) to drop the dentries and inodes out of the cache.

Comment by Olaf Faaland [ 17/Feb/21 ]

Thanks, Andreas.

Comment by Olaf Faaland [ 22/Feb/21 ]

The other issue occurring on this node, kthread_create() failures, seems not (at least not obviously) related to this. kthread_create() failures "went away" even under low memory. Removing topllnl.

Comment by Olaf Faaland [ 06/Oct/21 ]

I've updated this node to Lustre 2.14 and added a crontab to drop caches every 10 minutes.  This improves the situation, so it happens less often, but I still see intermittent failures to create threads etc. with ENOMEM.

[Mon Sep 13 09:26:52 2021] bash (10120): drop_caches: 2
[Mon Sep 13 09:36:49 2021] bash (28252): drop_caches: 2
[Mon Sep 13 09:46:33 2021] bash (4675): drop_caches: 2
[Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12
[Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) Skipped 3 previous similar messages
[Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) can't start ll_agl thread, rc: -12
[Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) Skipped 2 previous similar messages
[Mon Sep 13 09:56:31 2021] LustreError: 5295:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12 
Comment by James A Simmons [ 06/Oct/21 ]

The two struct super_operations you need to implement are:

long (*nr_cached_objects)(struct super_block *, struct shrink_control *);

long (*free_cached_objects)(struct super_block *, struct shrink_control *);

nr_cached_objects returns the number of inodes you can free and free_cached_objects does the actually freeing of the inodes.  xfs currently is the only one that implements this. 

This can be done under ticket LU-13833

Comment by Olaf Faaland [ 04/Nov/21 ]

This problem is affecting us more as we try to scan/monitor more more file systems. The fork() failures in particular cause other processes (ie those trying to fetch and process changelogs) to fail making the system quite fragile. I need some help finding a better workaround or a fix.

Note that I'm running the 2.14 client on this system, so it has the patch from LU-13909.

James, I don't understand why a per-superblock shrinker will help with this. I would think that the shrinker either isn't running or is running and being told nothing can be freed, or I wouldn't be seeing this issue. What am I missing?

Is there some evidence I can gather to evaluate Andreas' theory about subdirectories keeping their parent cached?

thanks

Comment by Andreas Dilger [ 09/Nov/21 ]

There is patch https://review.whamcloud.com/40011 "LU-13983 llite: rmdir releases inode on client" that was included in 2.14.0 but not backported to 2.12. It may help reduce the number of cached inodes on the client, but is specifically for when a directory is removed on a client.

Also, as a short-term workaround, the patch https://review.whamcloud.com/39973 "Subject: LU-13970 llite: add option to disable inode cache" adds an option "llite.*.inode_cache=off" on the client that can be used to disable the Lustre inode cache on the few clients that are doing full-filesystem scans, because they are unlikely to re-use the inode cache anyway, due to age and/or memory pressure.

Comment by Olaf Faaland [ 09/Nov/21 ]

Andreas, thanks for those ideas. This node where we see this problem is running 2.14, so it must have patch 40011 for LU-13983. Also, this client isn't modifying the file system at all - in fact, Lustre is mounted R/O at the moment.

I'll add patch 39973 for LU-13970 to our build and disable the Lustre inode cache.

Comment by Olaf Faaland [ 22/Nov/21 ]

Hi Andreas, patch 39973 applied cleanly and looks reasonable to me, but the inode_cache procfile isn't created when I mount a client. I'm off the rest of the week, so I'll figure out what's wrong next week and let you know if I need help with it.

Comment by Olaf Faaland [ 02/Dec/21 ]

My mistake, patch 39973 did create the inode_cache sysfs file on the node. I'll test with inode_cache=0 over the next week.

Comment by Olaf Faaland [ 13/Jan/22 ]

Andreas,

Patch 39973 was pulled into our stack and setting node_cache=0 has improved the situation.  Is this acceptable for pulling into master, or do you want to investigate other options?

Comment by Andreas Dilger [ 14/Jan/22 ]

Olaf, 39973 is in reasonable shape to land, and could potentially still make it into 2.15 if it is refreshed and reviewed quickly. My preference is to fix the root of the problem (inodes not being flushed from cache by the normal VM mechanisms), but 39973 is at least an option for specific workloads until the solution is found.

Comment by Olaf Faaland [ 15/Jan/22 ]

I agree finding the root cause would be best.  I can run debug patches and reproduce the issue.  I'm not sure whether I'll have time to learn enough to find the problem myself.

Comment by Olaf Faaland [ 08/Jun/22 ]

On Nov 4th I wrote:

This problem is affecting us more as we try to scan/monitor more more file systems. The fork() failures in particular cause other processes (ie those trying to fetch and process changelogs) to fail making the system quite fragile. I need some help finding a better workaround or a fix.

It turned out fork() and kthread_create() were failing not because of insufficient memory, but because sysctl kernel.pid_max was too small (this node is running RHEL 7 x86_64 with 32 cores and /sys/devices/system/cpu/possible reports 0-31, so pid_max was 32K by default). When the kernel fails to create a new process because it can't allocate a PID, errno is set to ENOMEM even though that's easy to misinterpret.

Comment by Oleg Drokin [ 10/Jun/22 ]

hm, that's certainly an interesting twist of events.

Is it ok to close this now or is there anything we could do better here?

Comment by Olaf Faaland [ 11/Jun/22 ]

> hm, that's certainly an interesting twist of events.

Yes! Thanks for the help eliminating memory usage from the list of potential root causes.

> Is it ok to close this now or is there anything we could do better here?

For a workload like Starfish where it's just reading directories and stat-ing files, the memory usage for cached inodes and dnodes seems excessive.  But it's not actually causing a problem that I'm aware of, and so I think closing this is reasonable.

I'll remove my incorrect speculation from the description so it doesn't mislead someone who comes across this ticket.

Generated at Sat Feb 10 03:09:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.