[LU-4641] ldiskfs_inode_cache slab high usage Created: 17/Feb/14 Updated: 03/Jul/14 Resolved: 03/Jul/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jason Hill (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.4, kernel 2.6.32_358.23.2.el6, including patch 9127 from |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12692 |
| Description |
|
We are seeing high usage from the ldiskfs_inode_cache slab. This filesystem has ~430M files in it, and currently we are utilizing ~90GB of ldiskfs_inode_cache. We are currently undergoing testing to create 500M files into a test filesystem, wanted to break this out from |
| Comments |
| Comment by Peter Jones [ 17/Feb/14 ] |
|
Niu Could you please advise on this ticket as the data comes in? Thanks Peter |
| Comment by Jason Hill (Inactive) [ 17/Feb/14 ] |
|
Update: echo 2 > /proc/sys/vm/drop_caches has helped immensely. Slab usage down to ~20GB, not growing highly. On test system where we are trying to create large number of files we see slab size growing at 1MB/s. Currently have 8M files created, target is 500M. Watching via collectl for slab usage on the production MDS. Trying to determine if something within the center is running mlocate via cron and causing a full filesystem walk to occur. After the MDS was rebooted on Friday the cache grew and grew until we started having issues allocating memory. Stay tuned. Thanks. |
| Comment by Jason Hill (Inactive) [ 17/Feb/14 ] |
|
We think we have the culprit. Found 4 Cray nodes that had updatedb enabled and the last update to the db was 2/16. I will ask to resolve this once we've verified this keeps the inode_cache slab usage down. |
| Comment by Matt Ezell [ 17/Feb/14 ] |
|
Do you have any suggestions to prevent this from happening in the future? Obviously, we want to keep updatedb from running against Lustre mounts, but will the MDS eventually cache enough inodes from normal usage? Should we set vm.zone_reclaim_mode or vm.min_free_kbytes ? To my knowledge (Jason, please confirm), we leave these at the default. |
| Comment by Niu Yawei (Inactive) [ 18/Feb/14 ] |
|
What's the value of vfs_cache_pressure? I think you'd tune it to a higher value to keep kernel reclaiming inode cache harder. Following is copied from kernel doc: vfs_cache_pressure ------------------ Controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes. |
| Comment by Jason Hill (Inactive) [ 18/Feb/14 ] |
|
Matt, correct we do not set these parameters specifically: [root@atlas-mds1 ~]# cat /proc/sys/vm/zone_reclaim_mode Niu: [root@atlas-mds1 ~]# cat /proc/sys/vm/vfs_cache_pressure |
| Comment by Niu Yawei (Inactive) [ 18/Feb/14 ] |
|
100 is the default value, I think you'd increase it to a larger value to see if it helps. |
| Comment by John Fuchs-Chesney (Inactive) [ 14/Mar/14 ] |
|
Jason or Matt, |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Mar/14 ] |
|
Looks like initial issue was resolved, and suggestions made to prevent a recurrence. |
| Comment by James Nunez (Inactive) [ 09/Apr/14 ] |
|
I'd like to get feedback from ORNL on if the larger vfs_cache_pressure solves their problem before we close this ticket. |
| Comment by James Nunez (Inactive) [ 01/May/14 ] |
|
Per ORNL, they made some configuration changes and this issue has not been seen since. Thus, they are not able to test if changes to vfs_cache_pressure help this issue. Please reopen the ticket if this problem is seen again. |
| Comment by Blake Caldwell [ 29/May/14 ] |
|
My original question when discussing this with James and Peter Jones was regarding a reasonable value for vfs_cache_pressure. However, I believe that our current issue may not be appropriate for tuning vfs_cache_pressure and more suitable for min_free_kbytes. We are still page allocation failures, but not from high ldiskfs_inode_cache usage. The catalyst to the page allocation failures is a process that reads inodes from the MDT device, so I would expect the buffer cache is being used now. Lustre processes are unable to allocate memory in these circumstances. Sample error messages on the mds: May 29 10:27:43 atlas-mds1 kernel: [1451338.151860] LustreError: 12106:0:(lvfs_lib.c:151:lprocfs_stats_alloc_one()) LNET: out of memory at /data/buildsystem/jsimmons-atlas/rpmbuild/BUILD/lustre-2.4.3/lustre/lvfs/lvfs_lib.c:151 (tried to alloc '(stats->ls_percpu[cpuid])' = 4224) order:2, mode:0x20 is a GFP_ATOMIC allocation, so it can be satisfied with reserved pages, so I believe increasing vm.min_free_kbytes would be better? If I'm off base here, please let me know, otherwise, we will plan on increasing vm.min_free_kbytes to 131072 from its current value 90112. Sar output showing the increase in number pages stolen from the caches. 08:40:01 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 07:10:01 186.65 9765.13 1416.66 0.00 2098.86 0.00 0.00 0.00 0.00 |
| Comment by Peter Jones [ 05/Jun/14 ] |
|
Reopening to track follow on question |
| Comment by Niu Yawei (Inactive) [ 06/Jun/14 ] |
|
I don't have objection on this, if current problem is short of lowmem, we'd consider increasing the value of min_free_kbytes and decreasing the value of lowmem_reserve_ratio. |
| Comment by Blake Caldwell [ 12/Jun/14 ] |
|
Increasing min_free_kbytes to reserve 128MB did not prevent these allocation failures. We just saw more. Next step is to decrease lowmem_reserve_ratio. Do you have a recommended value? The current value is |
| Comment by Niu Yawei (Inactive) [ 13/Jun/14 ] |
|
The middle value (256) is for normal zone, if you decrease this value, kernel will defending this zone more aggressively, however, I'm not sure why what value is proper.
I think reading inodes will populate ldiskfs_inode_cache, won't tuning vfs_cache_pressure help? |
| Comment by Blake Caldwell [ 16/Jun/14 ] |
|
I captured some more stats while the issue was present this morning. Jun 16 10:00:08 atlas-mds1 kernel: [515727.290363] ptlrpcd_18: page allocation failure. order:1, mode:0x20 I'm attaching the kernel logs with page stats that were dumped at the time of the failed allocations above. The normal zone looks to have plenty of pages available. It appears DMA zones are much tighter on pages. Could these be the source of allocation failures? Since it is reading directly from block device (not through lustre), the page cache is used rather than filesystem (buffer) caches. Since the inode/dentry cache is not of concern, I don't think tuning vfs_cache_pressure will help. System wide usage: Userspace program has RSS of 38.6G. Out of 51G of slab usage, 28G are from the size-512. Only 4G is is used by ldiskfs_inode_cache. /proc/zoneinfo currently: Node 0, zone DMA32 Node 0, zone Normal Node 1, zone Normal |
| Comment by Blake Caldwell [ 16/Jun/14 ] |
|
page allocation failure log messages |
| Comment by Niu Yawei (Inactive) [ 19/Jun/14 ] |
When reclaiming inode/dentry, the pagecache associated with the inode will be reclaimed too. Reading block device would consume lots of pagepache, I think it's worth a try that tuning the vfs_cache_pressure. (BTW: will drop_cache relieve the situation immediately?) |
| Comment by James Nunez (Inactive) [ 03/Jul/14 ] |
|
Per a conversation with ORNL, we can close this ticket. Please reopen if more work or information is needed. |