[LU-14408] very large lustre_inode_cache - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
3.10.0-1160.4.1.1chaos.ch6.x86_64
Server: lustre-2.12.6_3.llnl-1.ch6.x86_64
Client: lustre-2.14.0-something
starfish "agent"

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).

[root@solfish2:~]# cat /tmp/t4
 Active / Total Objects (% used)    : 508941033 / 523041216 (97.3%)
 Active / Total Slabs (% used)      : 11219941 / 11219941 (100.0%)
 Active / Total Caches (% used)     : 87 / 122 (71.3%)
 Active / Total Size (% used)       : 112878003.58K / 114522983.04K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K

OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
30545252  30067595  98%   1.12K     1092909  28        34973088K   ptlrpc_cache
92347047  92347047  99%   0.31K     1810744  51        28971904K   bio-3
92346672  92346672  100%  0.16K     1923889  48        15391112K   xfs_icr
92409312  92409312  100%  0.12K     2887791  32        11551164K   kmalloc-128
25717818  23912628  92%   0.19K     612329   42        4898632K    kmalloc-192
25236420  24708346  97%   0.18K     573555   44        4588440K    xfs_log_ticket
25286568  24717197  97%   0.17K     549708   46        4397664K    xfs_ili
14103054  13252206  93%   0.19K     335787   42        2686296K    dentry
...

The ptlrpc_cache shrinks from GB to MB in size upon

echo 2 > /proc/sys/vm/drop_caches

This particular node has 128GB of RAM, so this represents a very large portion of the total.

After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge. After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.

At the same time I saw this I saw kthread_run() and fork() failures reported in the console log. Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.

Attachments

Issue Links

is related to

LU-13833 hook llite to inode cache shrinker

Open

LU-13909 release invalid dentries proactively on client

Resolved

LU-13983 rmdir should release inode on Lustre client

Resolved

LU-13970 add an option to disable inode cache on Lustre client

Resolved

Activity

People

Assignee:: Oleg Drokin

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 10/Feb/21 1:44 AM

Updated:: 07/Aug/24 6:14 PM