[LU-16047] cache contention in ".lustre/fid/ - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- performance
- robinhood
Environment:
VMs + 2.12.8 + 3.10.0-1160.59.1
robinhood v3 + 2.12.8 + 3.10.0-1062

Rank (Obsolete):
9223372036854775807

Description

This issue was observed with robinhood clients:

robinhood become slower to sync the fs with changelog with the time
robinhood become slower if the reader is late (more negative entries generated).

"strace" on reader threads reveal that the FID stats could take several seconds.
drop_cache 2 or 3 fixes temporary the issue.

Reproducer
I was able to reproduce the issue with a "dumb" executable that generate a lot of "negative entries" with parallel stats on "<fs>/.lustre/fid/<non_existent_fid>".
The perf_fid_cont.svg is a flamegraph on the threads of the test process (fid_rand).

Most of the threads of fid_rand wait for the mutex of "./lustre/fid" inode in:

static int lookup_slow(struct nameidata *nd, struct path *path)
{                                                              
        struct dentry *dentry, *parent;                        
        int err;                                               
                                                               
        parent = nd->path.dentry;                              
        BUG_ON(nd->inode != parent->d_inode);                  
                                                               
        mutex_lock(&parent->d_inode->i_mutex);                            <--- contention here
        dentry = __lookup_hash(&nd->last, parent, nd->flags);  
        mutex_unlock(&parent->d_inode->i_mutex);

workarround

crontab with "echo 2 > /proc/sys/vm/drop_caches"
set the "/proc/sys/fs/negative-dentry-limit" on 3.10.0-1160 kernel

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

perf_fid_cont.svg
60 kB
25/Jul/22 7:24 PM

Activity

[LU-16047] cache contention in ".lustre/fid/

Etienne Aujames added a comment - 26/Jul/22 7:56 AM - edited

The robinhood node have the 3.10.0-1062, negative-dentry-limit parameter does not exist for this kernel. So we use "echo 2 > /proc/sys/vm/drop_caches" every minute to keep the number of changelog dequeued by robinhood up to 10k.

robinhood already have a de-duplicate mechanism on changelog to limit the number of "stat" on the filesystem, so it does not require negative dentry cache.
For now, I was not able to reproduce this with a lustre 2.15 (on the same scale, dentries are freed regularly), so maybe the https://review.whamcloud.com/39685/ (~~LU-13909~~ llite: prune invalid dentries) could help.

The CEA will try to upgrade the robinhood nodes with RHEL8 kernel to benefit from the cache improvements.

Etienne Aujames added a comment - 26/Jul/22 7:56 AM - edited The robinhood node have the 3.10.0-1062, negative-dentry-limit parameter does not exist for this kernel. So we use "echo 2 > /proc/sys/vm/drop_caches" every minute to keep the number of changelog dequeued by robinhood up to 10k. robinhood already have a de-duplicate mechanism on changelog to limit the number of "stat" on the filesystem, so it does not require negative dentry cache. For now, I was not able to reproduce this with a lustre 2.15 (on the same scale, dentries are freed regularly), so maybe the https://review.whamcloud.com/39685/ ( LU-13909 llite: prune invalid dentries) could help. The CEA will try to upgrade the robinhood nodes with RHEL8 kernel to benefit from the cache improvements.

Andreas Dilger added a comment - 25/Jul/22 9:41 PM

Do you have any stats on how many negative dentries need to accumulate in the .lustre/fid directory for this to become a problem, and how long it takes for that many negative dentries to accumulate? That would allow setting negative-dentry-limit to a reasonable default value (e.g. via /usr/lib/sysctl.d/lustre.conf) at startup.

One possible fix is to not cache negative dentries for the .lustre/fid directory at all, but this might increase loading on the MDS due to repeated negative FID lookup RPCs being sent to the MDS. That depends heavily on how often the same non-existent FID is being looked up multiple times, and unfortunately I have no idea whether that is common or not. It likely also depends heavily on whether the Changelog reader itself will discard repeated records for the same FID (essentially implementing its own negative FID cache in userspace).

Alternately, having some kind of periodic purging of old negative dentries on this directory would be possible. Something like dropping negative dentries after 30s (tunable?) of inactivity seems like a reasonable starting point. I couldn't find the /proc/sys/fs/negative-dentry-limit tunable on my RHEL8 server, but it appears that the newer kernel handles negative dentries better and does not need this tunable.

However, if the negative-dentry-limit parameter is working reasonably well for el7.9 kernels, and el8.x kernels don't have a problem, then maybe there isn't a need for a Lustre-specific patch? I do recall a number of patches being sent to linux-fsdevel related to limiting the negative dentry count, but I don't know if any of those patches landed. It seems highly unlikely that they were landed upstream in time for the el8.x kernel, but maybe one of those patches was backported to el8.x while they were still making up their minds about the upstream solution.

Andreas Dilger added a comment - 25/Jul/22 9:41 PM Do you have any stats on how many negative dentries need to accumulate in the .lustre/fid directory for this to become a problem, and how long it takes for that many negative dentries to accumulate? That would allow setting negative-dentry-limit to a reasonable default value (e.g. via /usr/lib/sysctl.d/lustre.conf ) at startup. One possible fix is to not cache negative dentries for the .lustre/fid directory at all, but this might increase loading on the MDS due to repeated negative FID lookup RPCs being sent to the MDS. That depends heavily on how often the same non-existent FID is being looked up multiple times, and unfortunately I have no idea whether that is common or not. It likely also depends heavily on whether the Changelog reader itself will discard repeated records for the same FID (essentially implementing its own negative FID cache in userspace). Alternately, having some kind of periodic purging of old negative dentries on this directory would be possible. Something like dropping negative dentries after 30s (tunable?) of inactivity seems like a reasonable starting point. I couldn't find the /proc/sys/fs/negative-dentry-limit tunable on my RHEL8 server, but it appears that the newer kernel handles negative dentries better and does not need this tunable . However, if the negative-dentry-limit parameter is working reasonably well for el7.9 kernels, and el8.x kernels don't have a problem, then maybe there isn't a need for a Lustre-specific patch? I do recall a number of patches being sent to linux-fsdevel related to limiting the negative dentry count, but I don't know if any of those patches landed. It seems highly unlikely that they were landed upstream in time for the el8.x kernel, but maybe one of those patches was backported to el8.x while they were still making up their minds about the upstream solution .

Etienne Aujames added a comment - 25/Jul/22 8:33 PM - edited

With:

[root@client ~]# cat /proc/sys/fs/dentry-state
8499096 8483804 45      0       8470127 0

And 100 threads doing state on non-existent fid and 20 threads doing stat on existent fid, I have the following latencies:

[root@client ~]#  for i in {1..10}; do  time stat /media/lustrefs/client/.lustre/fid/[0x200000402:0x66:0x0] ; done |& grep real
real    0m0.333s
real    0m0.352s
real    0m0.370s
real    0m0.003s
real    0m0.311s
real    0m0.296s
real    0m0.172s
real    0m0.345s
real    0m0.383s
real    0m0.330s

Etienne Aujames added a comment - 25/Jul/22 8:33 PM - edited With: [root@client ~]# cat /proc/sys/fs/dentry-state 8499096 8483804 45 0 8470127 0 And 100 threads doing state on non-existent fid and 20 threads doing stat on existent fid, I have the following latencies: [root@client ~]# for i in {1..10}; do time stat /media/lustrefs/client/.lustre/fid/[0x200000402:0x66:0x0] ; done |& grep real real 0m0.333s real 0m0.352s real 0m0.370s real 0m0.003s real 0m0.311s real 0m0.296s real 0m0.172s real 0m0.345s real 0m0.383s real 0m0.330s

cache contention in ".lustre/fid/

Details

Description

Attachments

Attachments

Activity

People

Dates