[LU-4906] rm -rf triggers too much MDS_READPAGE - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.6.0
Labels:
- llite
- mdc
- readdir

Severity:
3
Rank (Obsolete):
13548

Description

On 2.5.57-79-ge7f99a5 I see that rm -rf on a directory with n regular files causes O(n) MDS_READPAGE requests to be sent. I ran the following on 2.4.3 and master:

export MOUNT_2=y
llmount.sh

cd /mnt/lustre
echo clear | tee /proc/fs/lustre/mdc/*/stats
mkdir d
touch d/{0..255}
cat /proc/fs/lustre/mdc/*/stats

cd /mnt/lustre2
echo clear | tee /proc/fs/lustre/mdc/*/stats
rm -rf d
cat /proc/fs/lustre/mdc/*/stats

On 2.4.3:

## mkdir and touch
req_waittime              773 samples [usec] 65 55988 339933 3295555639
req_active                773 samples [reqs] 1 1 773 773
mds_close                 256 samples [usec] 65 486 58958 16078566
mds_reint                 257 samples [usec] 93 826 74764 29250026
ldlm_enqueue              259 samples [usec] 191 1423 150223 115570903
seq_query                 1 samples [usec] 55988 55988 55988 3134656144
## rm -rf
snapshot_time             1397509895.872324 secs.usecs
req_waittime              258 samples [usec] 50 1054 32830 6165600
req_active                258 samples [reqs] 1 1 258 258
ldlm_cancel               258 samples [usec] 50 1054 32830 6165600
snapshot_time             1397509895.872354 secs.usecs
req_waittime              524 samples [usec] 45 9854 212336 225452482
req_active                524 samples [reqs] 1 4 876 1660
mds_close                 1 samples [usec] 390 390 390 152100
mds_reint                 257 samples [usec] 331 9854 151751 209416267
mds_readpage              3 samples [usec] 271 323 902 272634
ldlm_enqueue              261 samples [usec] 45 736 59120 15595504
ldlm_cancel               2 samples [usec] 64 109 173 15977

On master:

## mkdir and touch:
snapshot_time             1397507941.992796 secs.usecs
snapshot_time             1397507941.992828 secs.usecs
req_waittime              1282 samples [usec] 50 2674 364172 203043372
req_active                1282 samples [reqs] 1 1 1282 1282
mds_close                 256 samples [usec] 61 640 50251 15984775
mds_reint                 257 samples [usec] 89 1045 60223 21230981
mds_getxattr              256 samples [usec] 50 658 38276 9345158
ldlm_enqueue              513 samples [usec] 82 2674 215422 156482458
## rm -rf
snapshot_time             1397507954.948995 secs.usecs
req_waittime              991 samples [usec] 31 5949 371017 322109413
req_active                991 samples [reqs] 1 9 2132 6404
mds_close                 1 samples [usec] 126 126 126 15876
mds_reint                 257 samples [usec] 168 5949 173654 221727790
mds_readpage              132 samples [usec] 158 2173 44232 21316906
mds_getxattr              60 samples [usec] 31 345 5769 828911
ldlm_enqueue              423 samples [usec] 44 2496 123479 70146809
ldlm_cancel               118 samples [usec] 65 891 23757 8073121
snapshot_time             1397507954.949096 secs.usecs
req_waittime              1 samples [usec] 108 108 108 11664
req_active                1 samples [reqs] 1 1 1 1
ldlm_cancel               1 samples [usec] 108 108 108 11664

(If you noticed that ldlm_enqueue and mds_reint are present, it's because I used http://review.whamcloud.com/#/c/6223/1 which is awesome and has been landable for nearly one year (still on patch set 1), but nobody ever reviews it.)

Attachments

Issue Links

duplicates

LU-3308 large readdir chunk size slows unlink/"rm -r" performance

Reopened

is related to

LU-4902 Do not require rpc_lock for readdir

Resolved

LU-5232 cache directory contents on file descriptor on lock revocation

Resolved

is related to

LU-4367 unlink performance regression on lustre-2.5.52 client

Resolved

Activity

[LU-4906] rm -rf triggers too much MDS_READPAGE

nasf (Inactive) added a comment - 29/Apr/14 12:41 AM

The statahead behaviour has been changed after moving the directory page cache from LLITE to MDC for striped directory.

Originally, the directory page cache is in LLITE, and the ldlm lock is only held when fetch the directory page from MDS. After that the directory ldlm lock will be released. And then the statahead thread will traversal the directory page without the directory ldlm lock, even if someone is performing "rm -rf" and cancel (or ELC) the directory ldlm lock from the client, the statahead still holds the page reference. It is not a serious issue for statahead to get "-ENOENT" because of some name entry has been removed, because it is statahead internal failure and invisible to the applications. The statahead thread still can go ahead to fee-fetch as much as possible. So there is no much directory ldlm lock ping-pong.

But the situation has changed after moving the directory page to MDC. In current implementation, when the statahead traversal the directory, it does not hold the page reference, it needs to verify the directory ldlm lock for every name entry. That means there will be the directory ldlm lock ping-pong between the statahead thread and the "rm -rf" thread for every name entry.

Summary, the original directory ldlm lock ping-pong is per page, the current directory ldlm lock ping-pong is per name entry. That is why the bad performance.

nasf (Inactive) added a comment - 29/Apr/14 12:41 AM The statahead behaviour has been changed after moving the directory page cache from LLITE to MDC for striped directory. Originally, the directory page cache is in LLITE, and the ldlm lock is only held when fetch the directory page from MDS. After that the directory ldlm lock will be released. And then the statahead thread will traversal the directory page without the directory ldlm lock, even if someone is performing "rm -rf" and cancel (or ELC) the directory ldlm lock from the client, the statahead still holds the page reference. It is not a serious issue for statahead to get "-ENOENT" because of some name entry has been removed, because it is statahead internal failure and invisible to the applications. The statahead thread still can go ahead to fee-fetch as much as possible. So there is no much directory ldlm lock ping-pong. But the situation has changed after moving the directory page to MDC. In current implementation, when the statahead traversal the directory, it does not hold the page reference, it needs to verify the directory ldlm lock for every name entry. That means there will be the directory ldlm lock ping-pong between the statahead thread and the "rm -rf" thread for every name entry. Summary, the original directory ldlm lock ping-pong is per page, the current directory ldlm lock ping-pong is per name entry. That is why the bad performance.

Andreas Dilger added a comment - 28/Apr/14 6:00 PM

Fan Yong, it seems that statahead is interacting badly with "rm -r". Is there some way to stop it from trying to prefetch the directory entries if the directory lock is being cancelled too often?

Andreas Dilger added a comment - 28/Apr/14 6:00 PM Fan Yong, it seems that statahead is interacting badly with "rm -r". Is there some way to stop it from trying to prefetch the directory entries if the directory lock is being cancelled too often?

Oleg Drokin added a comment - 15/Apr/14 6:17 PM

ok, so it's statahead's fault, I see.

Oleg Drokin added a comment - 15/Apr/14 6:17 PM ok, so it's statahead's fault, I see.

John Hammond added a comment - 15/Apr/14 6:16 PM

Oleg, I should have put more detail here. I was running strace alongside and rm -rf calls getdents twice, unlink 256 times, and rmdir once, it that order. It's the statahead thread that generates all but 2-3 of the readpage requests.

John Hammond added a comment - 15/Apr/14 6:16 PM Oleg, I should have put more detail here. I was running strace alongside and rm -rf calls getdents twice, unlink 256 times, and rmdir once, it that order. It's the statahead thread that generates all but 2-3 of the readpage requests.

Oleg Drokin added a comment - 15/Apr/14 6:16 PM - edited

so I just run strace on rm on lustre (current master) and I see getdents64 caching on the order of 1000 entries per call, btw.

getdents(3, /* 1001 entries */, 32768)  = 32024
brk(0x15cf000)                          = 0x15cf000
brk(0x15f0000)                          = 0x15f0000
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
unlinkat(4, "file221", 0)               = 0
...

Oleg Drokin added a comment - 15/Apr/14 6:16 PM - edited so I just run strace on rm on lustre (current master) and I see getdents64 caching on the order of 1000 entries per call, btw. getdents(3, /* 1001 entries */, 32768) = 32024 brk(0x15cf000) = 0x15cf000 brk(0x15f0000) = 0x15f0000 getdents(3, /* 0 entries */, 32768) = 0 close(3) = 0 unlinkat(4, "file221", 0) = 0 ...

Andreas Dilger added a comment - 15/Apr/14 6:10 PM - edited

On 2.4.2 (FC12 userspace) I see strace only does a single readdir:

strace rm -r /myth/tmp/ttt
:
:
getdents(3, /* 10 entries */, 32768)    = 240
newfstatat(3, "2", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
stat("/myth/tmp/ttt/2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
access("/myth/tmp/ttt/2", W_OK)         = 0
unlinkat(3, "2", 0)                     = 0
newfstatat(3, "7", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
stat("/myth/tmp/ttt/7", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
access("/myth/tmp/ttt/7", W_OK)         = 0
unlinkat(3, "7", 0)                     = 0
newfstatat(3, "4", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
stat("/myth/tmp/ttt/4", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
access("/myth/tmp/ttt/4", W_OK)         = 0
unlinkat(3, "4", 0)                     = 0
newfstatat(3, "3", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
stat("/myth/tmp/ttt/3", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
access("/myth/tmp/ttt/3", W_OK)         = 0

What is the userspace being tested here, and is it the same for both tests? What does the strace show for getdents() calls?

Andreas Dilger added a comment - 15/Apr/14 6:10 PM - edited On 2.4.2 (FC12 userspace) I see strace only does a single readdir: strace rm -r /myth/tmp/ttt : : getdents(3, /* 10 entries */, 32768) = 240 newfstatat(3, "2", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 stat("/myth/tmp/ttt/2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 access("/myth/tmp/ttt/2", W_OK) = 0 unlinkat(3, "2", 0) = 0 newfstatat(3, "7", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 stat("/myth/tmp/ttt/7", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 access("/myth/tmp/ttt/7", W_OK) = 0 unlinkat(3, "7", 0) = 0 newfstatat(3, "4", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 stat("/myth/tmp/ttt/4", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 access("/myth/tmp/ttt/4", W_OK) = 0 unlinkat(3, "4", 0) = 0 newfstatat(3, "3", {st_mode=S_IFREG|0664, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 stat("/myth/tmp/ttt/3", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 access("/myth/tmp/ttt/3", W_OK) = 0 What is the userspace being tested here, and is it the same for both tests? What does the strace show for getdents() calls?

Oleg Drokin added a comment - 15/Apr/14 6:06 PM

Potentially this is also related to ~~LU-4367~~

I think rm was always redoing readdir after every unlink, we had a ticket about this in bugzilla even that I can find if pressed.
This is all thanks to us not implementing readdir caching on per fd basis, so every time you rm something, the readdir cache is invalidated and next readdir leads to lock refetch.
There's some glibc caching (that does getdirents64), but apparently it's no longer works or something, no idea why ad it's not affected by lustre at all.

Oleg Drokin added a comment - 15/Apr/14 6:06 PM Potentially this is also related to LU-4367 I think rm was always redoing readdir after every unlink, we had a ticket about this in bugzilla even that I can find if pressed. This is all thanks to us not implementing readdir caching on per fd basis, so every time you rm something, the readdir cache is invalidated and next readdir leads to lock refetch. There's some glibc caching (that does getdirents64), but apparently it's no longer works or something, no idea why ad it's not affected by lustre at all.

Andreas Dilger added a comment - 15/Apr/14 3:01 AM

This is partly related to LU-3308, which is causing too-large readdir() RPCs to be issued and slow down the rm process. It would be allowed for the readdir() of the rm process to cache the entries and not issue more RPCs, either in userspace or in the kernel.

I've never looked at the rm implementation, but I imagine it is complex by its very presence in GNU fileutils.

LU-3308 also describes the mechanism by which the kernel could cache the directory entries or pages on the file descriptor in the kernel even after the MDS revoked the DLM lock.

Andreas Dilger added a comment - 15/Apr/14 3:01 AM This is partly related to LU-3308 , which is causing too-large readdir() RPCs to be issued and slow down the rm process. It would be allowed for the readdir() of the rm process to cache the entries and not issue more RPCs, either in userspace or in the kernel. I've never looked at the rm implementation, but I imagine it is complex by its very presence in GNU fileutils. LU-3308 also describes the mechanism by which the kernel could cache the directory entries or pages on the file descriptor in the kernel even after the MDS revoked the DLM lock.

People

Assignee:: Di Wang

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/Apr/14 9:24 PM

Updated:: 19/Jun/14 7:44 PM

Resolved:: 19/Jun/14 5:33 PM