Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
3.10.0-1160.4.1.1chaos.ch6.x86_64
Server: lustre-2.12.6_3.llnl-1.ch6.x86_64
Client: lustre-2.14.0-something
starfish "agent"
-
3
-
9223372036854775807
Description
The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).
[root@solfish2:~]# cat /tmp/t4 Active / Total Objects (% used) : 508941033 / 523041216 (97.3%) Active / Total Slabs (% used) : 11219941 / 11219941 (100.0%) Active / Total Caches (% used) : 87 / 122 (71.3%) Active / Total Size (% used) : 112878003.58K / 114522983.04K (98.6%) Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K OBJS ACTIVE USE OBJ_SIZE SLABS OBJ/SLAB CACHE_SIZE NAME 30545252 30067595 98% 1.12K 1092909 28 34973088K ptlrpc_cache 92347047 92347047 99% 0.31K 1810744 51 28971904K bio-3 92346672 92346672 100% 0.16K 1923889 48 15391112K xfs_icr 92409312 92409312 100% 0.12K 2887791 32 11551164K kmalloc-128 25717818 23912628 92% 0.19K 612329 42 4898632K kmalloc-192 25236420 24708346 97% 0.18K 573555 44 4588440K xfs_log_ticket 25286568 24717197 97% 0.17K 549708 46 4397664K xfs_ili 14103054 13252206 93% 0.19K 335787 42 2686296K dentry ...
The ptlrpc_cache shrinks from GB to MB in size upon
echo 2 > /proc/sys/vm/drop_caches
This particular node has 128GB of RAM, so this represents a very large portion of the total.
After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge. After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.
At the same time I saw this I saw kthread_run() and fork() failures reported in the console log. Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.