[LU-4641] ldiskfs_inode_cache slab high usage Created: 17/Feb/14  Updated: 03/Jul/14  Resolved: 03/Jul/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jason Hill (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL 6.4, kernel 2.6.32_358.23.2.el6, including patch 9127 from LU-4579, and LU-4006.


Attachments: HTML File atlas-mds1_page_allocation_failures    
Severity: 3
Rank (Obsolete): 12692

 Description   

We are seeing high usage from the ldiskfs_inode_cache slab. This filesystem has ~430M files in it, and currently we are utilizing ~90GB of ldiskfs_inode_cache.

We are currently undergoing testing to create 500M files into a test filesystem, wanted to break this out from LU-4570. More data to come.



 Comments   
Comment by Peter Jones [ 17/Feb/14 ]

Niu

Could you please advise on this ticket as the data comes in?

Thanks

Peter

Comment by Jason Hill (Inactive) [ 17/Feb/14 ]

Update:

echo 2 > /proc/sys/vm/drop_caches has helped immensely. Slab usage down to ~20GB, not growing highly.

On test system where we are trying to create large number of files we see slab size growing at 1MB/s. Currently have 8M files created, target is 500M. Watching via collectl for slab usage on the production MDS.

Trying to determine if something within the center is running mlocate via cron and causing a full filesystem walk to occur. After the MDS was rebooted on Friday the cache grew and grew until we started having issues allocating memory.

Stay tuned. Thanks.

Comment by Jason Hill (Inactive) [ 17/Feb/14 ]

We think we have the culprit. Found 4 Cray nodes that had updatedb enabled and the last update to the db was 2/16. I will ask to resolve this once we've verified this keeps the inode_cache slab usage down.

Comment by Matt Ezell [ 17/Feb/14 ]

Do you have any suggestions to prevent this from happening in the future? Obviously, we want to keep updatedb from running against Lustre mounts, but will the MDS eventually cache enough inodes from normal usage?

Should we set vm.zone_reclaim_mode or vm.min_free_kbytes ? To my knowledge (Jason, please confirm), we leave these at the default.

Comment by Niu Yawei (Inactive) [ 18/Feb/14 ]

What's the value of vfs_cache_pressure? I think you'd tune it to a higher value to keep kernel reclaiming inode cache harder. Following is copied from kernel doc:

vfs_cache_pressure
------------------

Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects.

At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.

Comment by Jason Hill (Inactive) [ 18/Feb/14 ]

Matt, correct we do not set these parameters specifically:

[root@atlas-mds1 ~]# cat /proc/sys/vm/zone_reclaim_mode
0
[root@atlas-mds1 ~]# cat /proc/sys/vm/min_free_kbytes
90112

Niu:

[root@atlas-mds1 ~]# cat /proc/sys/vm/vfs_cache_pressure
100

Comment by Niu Yawei (Inactive) [ 18/Feb/14 ]

100 is the default value, I think you'd increase it to a larger value to see if it helps.

Comment by John Fuchs-Chesney (Inactive) [ 14/Mar/14 ]

Jason or Matt,
Any further progress/action on this issue?
Thanks,
~ jfc

Comment by John Fuchs-Chesney (Inactive) [ 29/Mar/14 ]

Looks like initial issue was resolved, and suggestions made to prevent a recurrence.
~ jfc.

Comment by James Nunez (Inactive) [ 09/Apr/14 ]

I'd like to get feedback from ORNL on if the larger vfs_cache_pressure solves their problem before we close this ticket.

Comment by James Nunez (Inactive) [ 01/May/14 ]

Per ORNL, they made some configuration changes and this issue has not been seen since. Thus, they are not able to test if changes to vfs_cache_pressure help this issue.

Please reopen the ticket if this problem is seen again.

Comment by Blake Caldwell [ 29/May/14 ]

My original question when discussing this with James and Peter Jones was regarding a reasonable value for vfs_cache_pressure. However, I believe that our current issue may not be appropriate for tuning vfs_cache_pressure and more suitable for min_free_kbytes. We are still page allocation failures, but not from high ldiskfs_inode_cache usage. The catalyst to the page allocation failures is a process that reads inodes from the MDT device, so I would expect the buffer cache is being used now. Lustre processes are unable to allocate memory in these circumstances. Sample error messages on the mds:

May 29 10:27:43 atlas-mds1 kernel: [1451338.151860] LustreError: 12106:0:(lvfs_lib.c:151:lprocfs_stats_alloc_one()) LNET: out of memory at /data/buildsystem/jsimmons-atlas/rpmbuild/BUILD/lustre-2.4.3/lustre/lvfs/lvfs_lib.c:151 (tried to alloc '(stats->ls_percpu[cpuid])' = 4224)
May 29 10:27:43 atlas-mds1 kernel: [1451338.187813] LustreError: 12106:0:(lvfs_lib.c:151:lprocfs_stats_alloc_one()) LNET: 1493692160 total bytes allocated by lnet
May 29 10:30:01 atlas-mds1 kernel: [1451476.211959] swapper: page allocation failure. order:2, mode:0x20

order:2, mode:0x20 is a GFP_ATOMIC allocation, so it can be satisfied with reserved pages, so I believe increasing vm.min_free_kbytes would be better? If I'm off base here, please let me know, otherwise, we will plan on increasing vm.min_free_kbytes to 131072 from its current value 90112.

Sar output showing the increase in number pages stolen from the caches.

08:40:01 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff

07:10:01 186.65 9765.13 1416.66 0.00 2098.86 0.00 0.00 0.00 0.00
07:20:01 210.79 10055.52 1397.76 0.00 2861.71 0.00 0.00 0.00 0.00
07:30:01 131.14 10974.77 1235.35 0.00 1652.99 0.00 0.00 0.00 0.00
07:40:01 426522.51 6263.64 2756.01 0.00 478545.36 0.00 0.00 0.00 0.00
07:50:01 361767.26 8035.57 14631.19 0.00 48198.88 0.00 0.00 0.00 0.00
08:00:01 204183.64 9792.66 1260.39 0.00 1829.13 0.00 0.00 0.00 0.00
08:10:01 285855.99 9190.99 891.64 0.00 1629.84 0.00 0.00 0.00 0.00
08:20:01 605935.81 8329.74 2458.08 0.00 42271.84 0.00 0.00 0.00 0.00
08:30:01 350884.42 9249.91 882.08 0.00 1874.32 0.00 0.00 0.00 0.00
08:40:01 182957.52 12157.11 881.74 0.00 1645.76 0.00 0.00 0.00 0.00
08:50:01 116249.49 10314.71 869.47 0.00 1584.80 25.73 0.00 22.22 86.37
09:00:01 162919.34 9482.70 877.63 0.13 30862.31 9237.61 0.00 9176.72 99.34
09:10:01 192086.91 9473.00 910.78 0.00 6749.49 193.89 7.57 122.82 60.96
09:20:01 163713.12 10507.92 872.73 0.00 20656.07 6495.97 3.82 6459.04 99.37
09:30:01 143792.88 10704.30 873.22 0.00 6633.74 1056.53 200.97 1242.53 98.81
09:40:01 104059.86 10166.70 886.59 0.12 102152.89 6556.49 359.32 6890.73 99.64
09:50:02 104577.50 10842.55 886.01 0.02 14661.68 2384.46 353.41 2734.60 99.88
10:00:01 180071.92 10456.90 805.04 0.00 62215.50 4995.73 223.59 5219.43 100.00
10:10:01 136018.15 11376.03 1009.02 0.00 19493.03 7271.04 552.33 7823.36 100.00
10:20:01 150911.00 10693.11 872.92 0.12 15671.27 8858.82 778.53 9637.33 100.00
10:30:01 117841.52 13424.69 898.84 0.00 8208.61 4252.03 638.23 4890.13 100.00
10:40:01 143435.88 10189.68 900.52 0.01 39258.78 4963.34 1007.84 5971.28 100.00
10:50:01 124.18 10771.05 849.31 0.00 42637.89 0.00 0.00 0.00 0.00
11:00:01 432.96 12335.04 899.87 0.00 911.12 0.00 0.00 0.00 0.00

Comment by Peter Jones [ 05/Jun/14 ]

Reopening to track follow on question

Comment by Niu Yawei (Inactive) [ 06/Jun/14 ]

I don't have objection on this, if current problem is short of lowmem, we'd consider increasing the value of min_free_kbytes and decreasing the value of lowmem_reserve_ratio.

Comment by Blake Caldwell [ 12/Jun/14 ]

Increasing min_free_kbytes to reserve 128MB did not prevent these allocation failures. We just saw more. Next step is to decrease lowmem_reserve_ratio. Do you have a recommended value? The current value is
vm.lowmem_reserve_ratio = 256 256 32

Comment by Niu Yawei (Inactive) [ 13/Jun/14 ]

The middle value (256) is for normal zone, if you decrease this value, kernel will defending this zone more aggressively, however, I'm not sure why what value is proper.

We are still page allocation failures, but not from high ldiskfs_inode_cache usage. The catalyst to the page allocation failures is a process that reads inodes from the MDT device, so I would expect the buffer cache is being used now.

I think reading inodes will populate ldiskfs_inode_cache, won't tuning vfs_cache_pressure help?

Comment by Blake Caldwell [ 16/Jun/14 ]

I captured some more stats while the issue was present this morning.

Jun 16 10:00:08 atlas-mds1 kernel: [515727.290363] ptlrpcd_18: page allocation failure. order:1, mode:0x20
Jun 16 10:00:08 atlas-mds1 kernel: [515727.290537] ptlrpcd_4: page allocation failure. order:1, mode:0x20
Jun 16 10:00:08 atlas-mds1 kernel: [515727.290567] ptlrpcd_12: page allocation failure. order:1, mode:0x20

I'm attaching the kernel logs with page stats that were dumped at the time of the failed allocations above. The normal zone looks to have plenty of pages available. It appears DMA zones are much tighter on pages. Could these be the source of allocation failures?

Since it is reading directly from block device (not through lustre), the page cache is used rather than filesystem (buffer) caches. Since the inode/dentry cache is not of concern, I don't think tuning vfs_cache_pressure will help.

System wide usage:
MEM | tot 252.2G | free 672.5M | cache 107.7G | dirty 5.2M | buff 50.4G | slab 51.4G |

Userspace program has RSS of 38.6G. Out of 51G of slab usage, 28G are from the size-512. Only 4G is is used by ldiskfs_inode_cache.

/proc/zoneinfo currently:
Node 0, zone DMA
pages free 3935
min 1
low 1
high 1
protection: (0, 1931, 129191, 129191)

Node 0, zone DMA32
pages free 96690
min 244
low 305
high 366
protection: (0, 0, 127260, 127260)

Node 0, zone Normal
pages free 25763
min 16132
low 20165
high 24198
protection: (0, 0, 0, 0)

Node 1, zone Normal
pages free 23517
min 16388
low 20485
high 24582
protection: (0, 0, 0, 0)

Comment by Blake Caldwell [ 16/Jun/14 ]

page allocation failure log messages

Comment by Niu Yawei (Inactive) [ 19/Jun/14 ]

Since it is reading directly from block device (not through lustre), the page cache is used rather than filesystem (buffer) caches. Since the inode/dentry cache is not of concern, I don't think tuning vfs_cache_pressure will help.

When reclaiming inode/dentry, the pagecache associated with the inode will be reclaimed too. Reading block device would consume lots of pagepache, I think it's worth a try that tuning the vfs_cache_pressure. (BTW: will drop_cache relieve the situation immediately?)

Comment by James Nunez (Inactive) [ 03/Jul/14 ]

Per a conversation with ORNL, we can close this ticket.

Please reopen if more work or information is needed.

Generated at Sat Feb 10 01:44:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.