[LU-14408] very large lustre_inode_cache Created: 10/Feb/21 Updated: 25/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
3.10.0-1160.4.1.1chaos.ch6.x86_64 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood). [root@solfish2:~]# cat /tmp/t4 Active / Total Objects (% used) : 508941033 / 523041216 (97.3%) Active / Total Slabs (% used) : 11219941 / 11219941 (100.0%) Active / Total Caches (% used) : 87 / 122 (71.3%) Active / Total Size (% used) : 112878003.58K / 114522983.04K (98.6%) Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K OBJS ACTIVE USE OBJ_SIZE SLABS OBJ/SLAB CACHE_SIZE NAME 30545252 30067595 98% 1.12K 1092909 28 34973088K ptlrpc_cache 92347047 92347047 99% 0.31K 1810744 51 28971904K bio-3 92346672 92346672 100% 0.16K 1923889 48 15391112K xfs_icr 92409312 92409312 100% 0.12K 2887791 32 11551164K kmalloc-128 25717818 23912628 92% 0.19K 612329 42 4898632K kmalloc-192 25236420 24708346 97% 0.18K 573555 44 4588440K xfs_log_ticket 25286568 24717197 97% 0.17K 549708 46 4397664K xfs_ili 14103054 13252206 93% 0.19K 335787 42 2686296K dentry ... The ptlrpc_cache shrinks from GB to MB in size upon echo 2 > /proc/sys/vm/drop_caches This particular node has 128GB of RAM, so this represents a very large portion of the total. After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge. After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache. At the same time I saw this I saw kthread_run() and fork() failures reported in the console log. Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free. |
| Comments |
| Comment by Olaf Faaland [ 10/Feb/21 ] |
|
I wonder if it's related to |
| Comment by Oleg Drokin [ 10/Feb/21 ] |
|
Modern kernels seem to combine unrelated slub caches of same size items together so it could be something else (likely something lock/inode related?) there's an option to separate them to know for sure - set slab_nomerge/slub_nomerge in kernel commandline. I guess if we assume it's the same issue as |
| Comment by Olaf Faaland [ 10/Feb/21 ] |
|
Thanks Oleg, that's really helpful. I'll set those commandline options and see what things look like. |
| Comment by Olaf Faaland [ 11/Feb/21 ] |
|
After using slab_nomerge / slub_nomerge [root@solfish2:~]# free -g total used free shared buff/cache available Mem: 125 121 1 0 3 3 Swap: 127 0 127 [root@solfish2:~]# slabtop --sort c --once | head -n 20 Active / Total Objects (% used) : 501176259 / 520226233 (96.3%) Active / Total Slabs (% used) : 10235735 / 10235735 (100.0%) Active / Total Caches (% used) : 170 / 251 (67.7%) Active / Total Size (% used) : 111121653.05K / 112324794.23K (98.9%) Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K OBJS ACTIVE USE OBJ_SIZE SLABS OBJ/SLAB CACHE_SIZE NAME 45514686 45514686 100% 1.12K 1626337 28 52042784K lustre_inode_cache 44019375 44019375 99% 0.31K 863125 51 13810000K osc_object_kmem 42476616 42476616 100% 0.19K 1011348 42 8090784K kmalloc-192 42464092 42464092 100% 0.18K 965093 44 7720744K lov_object_kmem 42464118 42464118 100% 0.17K 923133 46 7385064K vvp_object_kmem 44019360 44019360 100% 0.16K 917070 48 7336560K lovsub_object_kmem 44019200 44019200 100% 0.12K 1375600 32 5502400K lov_oinfo 42729162 42720524 99% 0.09K 1017361 42 4069444K kmalloc-96 59263424 43529019 73% 0.06K 925991 64 3703964K kmalloc-64 11839254 11771513 99% 0.19K 281887 42 2255096K dentry 84897280 84897280 100% 0.01K 165815 512 663260K kmalloc-8 14558592 11710836 80% 0.03K 113739 128 454956K kmalloc-32 170612 169231 99% 0.94K 5018 34 160576K xfs_inode |
| Comment by Olaf Faaland [ 11/Feb/21 ] |
|
So it seems the lustre_inode_cache is what is really taking up the space, and as you said had been merged with the ptlrpc_cache. I learned about vfs_cache_pressure today, which is currently set to the default value of 100. I'll try increasing that to some larger value. Are the objects in lustre_inode_cache there because VFS has cached corresponding objects, and so the really high memory usage is just a side-effect of something the kernel is doing, like caching dentries? |
| Comment by Olaf Faaland [ 11/Feb/21 ] |
|
Setting vfs_cache_pressure to 200 and then to 1000 seemed to make nod difference. But the system seems stable even though free memory is low, so maybe the problem is a bug related to cache merging. |
| Comment by Andreas Dilger [ 13/Feb/21 ] |
|
What is strange here is that the number of cached inodes looks to be far more than the number of locks held by the client, since ldlm_lock and ldlm_resource do not even appear in this list. The one patch in Maybe the issue is just that Starfish + Lustre is accessing a filesystem with many millions of directories (about 45M it would appear?), and the VFS is not doing very well at dealing with that case when it has so much RAM? It may be that we need to do some extra work in Lustre to try and free the parent directory if it no longer has any child dentries on it, and does not have a lock? I see several network filesystems are calling shrink_dcache_parent(), but I haven't looked into the exact details of where this is called. |
| Comment by Olaf Faaland [ 16/Feb/21 ] |
|
Andreas, That's a good point. I'll find out how many directories there are in this file system. 45M is entirely possible. |
| Comment by James A Simmons [ 16/Feb/21 ] |
|
Their does exist a hook in newer kernels to shrink the inode cache. See LU-13833. It appears to be supported in RHEL7 as well. |
| Comment by Olaf Faaland [ 16/Feb/21 ] |
|
Andreas, there are very likely around 45M directories on the file system being scanned. It had 47M last May. |
| Comment by Andreas Dilger [ 17/Feb/21 ] |
|
Olaf, in the meantime, for clients like this that run filesystem scanners like Starfish, RobinHood, etc. you can run "echo 2 > /proc/sys/vm/drop_caches"periodically (eg. every 10 minutes from cron) to drop the dentries and inodes out of the cache. |
| Comment by Olaf Faaland [ 17/Feb/21 ] |
|
Thanks, Andreas. |
| Comment by Olaf Faaland [ 22/Feb/21 ] |
|
The other issue occurring on this node, kthread_create() failures, seems not (at least not obviously) related to this. kthread_create() failures "went away" even under low memory. Removing topllnl. |
| Comment by Olaf Faaland [ 06/Oct/21 ] |
|
I've updated this node to Lustre 2.14 and added a crontab to drop caches every 10 minutes. This improves the situation, so it happens less often, but I still see intermittent failures to create threads etc. with ENOMEM. [Mon Sep 13 09:26:52 2021] bash (10120): drop_caches: 2 [Mon Sep 13 09:36:49 2021] bash (28252): drop_caches: 2 [Mon Sep 13 09:46:33 2021] bash (4675): drop_caches: 2 [Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12 [Mon Sep 13 09:56:30 2021] LustreError: 5311:0:(statahead.c:1614:start_statahead_thread()) Skipped 3 previous similar messages [Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) can't start ll_agl thread, rc: -12 [Mon Sep 13 09:56:31 2021] LustreError: 5293:0:(statahead.c:991:ll_start_agl()) Skipped 2 previous similar messages [Mon Sep 13 09:56:31 2021] LustreError: 5295:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12 |
| Comment by James A Simmons [ 06/Oct/21 ] |
|
The two struct super_operations you need to implement are: long (*nr_cached_objects)(struct super_block *, struct shrink_control *); long (*free_cached_objects)(struct super_block *, struct shrink_control *); nr_cached_objects returns the number of inodes you can free and free_cached_objects does the actually freeing of the inodes. xfs currently is the only one that implements this. This can be done under ticket LU-13833 |
| Comment by Olaf Faaland [ 04/Nov/21 ] |
|
This problem is affecting us more as we try to scan/monitor more more file systems. The fork() failures in particular cause other processes (ie those trying to fetch and process changelogs) to fail making the system quite fragile. I need some help finding a better workaround or a fix. Note that I'm running the 2.14 client on this system, so it has the patch from James, I don't understand why a per-superblock shrinker will help with this. I would think that the shrinker either isn't running or is running and being told nothing can be freed, or I wouldn't be seeing this issue. What am I missing? Is there some evidence I can gather to evaluate Andreas' theory about subdirectories keeping their parent cached? thanks |
| Comment by Andreas Dilger [ 09/Nov/21 ] |
|
There is patch https://review.whamcloud.com/40011 " Also, as a short-term workaround, the patch https://review.whamcloud.com/39973 "Subject: |
| Comment by Olaf Faaland [ 09/Nov/21 ] |
|
Andreas, thanks for those ideas. This node where we see this problem is running 2.14, so it must have patch 40011 for I'll add patch 39973 for |
| Comment by Olaf Faaland [ 22/Nov/21 ] |
|
Hi Andreas, patch 39973 applied cleanly and looks reasonable to me, but the inode_cache procfile isn't created when I mount a client. I'm off the rest of the week, so I'll figure out what's wrong next week and let you know if I need help with it. |
| Comment by Olaf Faaland [ 02/Dec/21 ] |
|
My mistake, patch 39973 did create the inode_cache sysfs file on the node. I'll test with inode_cache=0 over the next week. |
| Comment by Olaf Faaland [ 13/Jan/22 ] |
|
Andreas, Patch 39973 was pulled into our stack and setting node_cache=0 has improved the situation. Is this acceptable for pulling into master, or do you want to investigate other options? |
| Comment by Andreas Dilger [ 14/Jan/22 ] |
|
Olaf, 39973 is in reasonable shape to land, and could potentially still make it into 2.15 if it is refreshed and reviewed quickly. My preference is to fix the root of the problem (inodes not being flushed from cache by the normal VM mechanisms), but 39973 is at least an option for specific workloads until the solution is found. |
| Comment by Olaf Faaland [ 15/Jan/22 ] |
|
I agree finding the root cause would be best. I can run debug patches and reproduce the issue. I'm not sure whether I'll have time to learn enough to find the problem myself. |
| Comment by Olaf Faaland [ 08/Jun/22 ] |
|
On Nov 4th I wrote:
It turned out fork() and kthread_create() were failing not because of insufficient memory, but because sysctl kernel.pid_max was too small (this node is running RHEL 7 x86_64 with 32 cores and /sys/devices/system/cpu/possible reports 0-31, so pid_max was 32K by default). When the kernel fails to create a new process because it can't allocate a PID, errno is set to ENOMEM even though that's easy to misinterpret. |
| Comment by Oleg Drokin [ 10/Jun/22 ] |
|
hm, that's certainly an interesting twist of events. Is it ok to close this now or is there anything we could do better here? |
| Comment by Olaf Faaland [ 11/Jun/22 ] |
|
> hm, that's certainly an interesting twist of events. Yes! Thanks for the help eliminating memory usage from the list of potential root causes. > Is it ok to close this now or is there anything we could do better here? For a workload like Starfish where it's just reading directories and stat-ing files, the memory usage for cached inodes and dnodes seems excessive. But it's not actually causing a problem that I'm aware of, and so I think closing this is reasonable. I'll remove my incorrect speculation from the description so it doesn't mislead someone who comes across this ticket. |