[LU-3308] large readdir chunk size slows unlink/"rm -r" performance Created: 10/May/13 Updated: 01/Feb/24 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Nathan Rutman | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | lug23dd, medium, performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 8195 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Shared directory unlinks using metabench seem significantly slower with Lustre 2.x clients versus Lustre 1.8.x 512 processes, 8 ppn, 2.1 server creates are comprable at 28k creates/s Mdtest shows no such regression. Digging into metabench a little, it seems that when deleting files, metabench processes (I am told) readdir, select "the next" file, and delete it, effectively racing each other, whereas mdtest deterministically choses the files to delete based on process id. This seems to imply it's directory locking contention on the MDT. On the clients, the majority difference in time is spend in ptlrpc_queue_wait (not sure of units): On the MDT, the big difference is mdt_object_find_lock: Also, using ldlm stats, it seems the 2.3 clients cause twice as many ldlm_bl_callbacks as 1.8.8 clients. So apparently the 2.3 client is holding directory locks differently than the 1.8 clients. We're still looking into this, but if anyone has thoughts we'd love to hear them. |
| Comments |
| Comment by Nathan Rutman [ 10/May/13 ] | ||||||||||||||||||
|
Xyratex-bug-id: LELUS-138 | ||||||||||||||||||
| Comment by Keith Mannthey (Inactive) [ 10/May/13 ] | ||||||||||||||||||
|
Is this the metabench you are referring to? http://www.7byte.com/?page=metabench Have you hand a chance to spin Master lately? | ||||||||||||||||||
| Comment by Cory Spitz [ 10/May/13 ] | ||||||||||||||||||
|
Keith, the Metabench code we're using is attached to LU-1167. It is not the version from 7byte.com. Yes, master (well 2.3.63) isn't any better at all. Maybe even a little worse | ||||||||||||||||||
| Comment by Andreas Dilger [ 10/May/13 ] | ||||||||||||||||||
|
It would be useful to get the performance of 2.1 clients as well, to see how it stacks up. The client ptlrpc_queue_wait() is just a side-effect of the client waiting for the server to complete the requests. It might be useful to gather stats (strace?) to see which operations it is waiting on - unlink() or readdir() One area to check is that the 2.3+ clients will do readdir in large chunks compared to earlier versions. If metabench clients are continually doing readdir and unlinking files, then this might cause a larger amount of data to be sent to all of the clients. An application might do the readdir() once at the beginning, then delete entries (N % num_clients + client_no) so that it only has to do readdir once. Looking at the metabench code, it depends on how large the cache in gibc is, since it is calling readdir() for every entry. It also appears to be doing lstat(filename) on every inode before deleting it, definitely a case of lock ping-pong. | ||||||||||||||||||
| Comment by Oleg Drokin [ 10/May/13 ] | ||||||||||||||||||
|
I wonder if ELC broke and that leads to bigger number of blocking callbacks in 2.3? | ||||||||||||||||||
| Comment by Cory Spitz [ 12/May/13 ] | ||||||||||||||||||
|
Yes we do have traces and analysis to share, which will be coming shortly. I don't have 2.1 client performance handy, but I can tell you that 2.2 gives similar results. In fact, 2.2, 2.3, and master/2.4 (2.3.63) are all in the same neighborhood. Interestingly enough, removing the lstat() completely didn't change the rates at all. I don't know if that necessarily throws out any theories. Also, Nathan noted that single dir unlink scenario in mdtest shows no such regression. But it is also worth noting that the same Metabench behavior of readdir()+lstat()+unlink() doesn't show any regression for the multi-dir case. Would that kind of information rule out the theory on any ELC breakage? | ||||||||||||||||||
| Comment by Cory Spitz [ 13/May/13 ] | ||||||||||||||||||
|
We have verified that removing the readdir() + lstat() from Metabench creates mdtest-like results. Using a 2.2 server (with pdirops) did not correct the regression. [It may have boosted the performance, but we didn't gather a baseline on the test HW/SW config]. Performance of the readdir&lstat-less Metabench version on 2.3 was on-par to even the Metabench multidir case (server @ 2.2). Removing the readdir()+lstat() from Metabench had little effect on 1.8.x client runs. Previous tests with only lstat removed had little to no effect on 2.x clients, but we plan to quickly introduce it to the readdir-less version to further validate that the regression is caused by the interference of readdir with unlink. | ||||||||||||||||||
| Comment by Alex Zhuravlev [ 13/May/13 ] | ||||||||||||||||||
|
can statahead be involved somehow? I'd suggest to look at rpc stats on all the sides. | ||||||||||||||||||
| Comment by Oleg Drokin [ 13/May/13 ] | ||||||||||||||||||
|
Cory: | ||||||||||||||||||
| Comment by Alexander Boyko [ 15/May/13 ] | ||||||||||||||||||
|
Xyratex have tested patch with tunable readdir number of pages for data transfer. Here is the result of running 8 clients, 8ppn and total 64k files:
Big readdir chunk size in 2.x client does cause the lock ping-pong and impact the overall unlink performance for shared directory. | ||||||||||||||||||
| Comment by Andreas Dilger [ 15/May/13 ] | ||||||||||||||||||
|
Alexander, thanks for the data. In this case it makes sense for the client or MDS to auto-tune the readdir chunk size based on usage. If there are readahead directory pages that are being dropped from cache without being accessed, then it makes sense to reduce the readdir chunk size automatically. Unfortunately, I don't think there is any way for the kernel to know that "rm -r" or metabench is the one calling readdir() the first time, so it would still make sense to read a full 1MB of readdir data on first access, and only drop the readdir() chunk size on later access. | ||||||||||||||||||
| Comment by Oleg Drokin [ 15/May/13 ] | ||||||||||||||||||
|
Ok, so I think the correct approach here is not to reduce the transfer size, but instead we need to use the other idea we dismissed in the past: POSIX allows us to retain cached pages for the readdir purposes as long as the filehandle remains open, so if we do that now, we'll still have all the benefits of fast rm (even faster due to no need to refresh the pages all the time) and at the same time avoid needless page pingpong for readdir. In the past this was deemed unimportant since we did not do statahead so extra pingpong was pretty minimal, but now it becomes more important I think. | ||||||||||||||||||
| Comment by Nathan Rutman [ 15/May/13 ] | ||||||||||||||||||
|
@Oleg - but don't the cached readdir pages have to be dropped if other clients are simultaneously deleting files? Are you saying POSIX allows for operation on stale directory entries? | ||||||||||||||||||
| Comment by Oleg Drokin [ 15/May/13 ] | ||||||||||||||||||
|
What I mean is posix allows for pages to remain cached for the application that has the filehandle open, for as long as the filehandle remains open. So basically now when lockss get cancelled, we are just dropping pages from cache completely. | ||||||||||||||||||
| Comment by Andreas Dilger [ 15/May/13 ] | ||||||||||||||||||
|
The relevant sections in the readdir(2) definition in SUSv2 http://pubs.opengroup.org/onlinepubs/007908799/xsh/readdir.html:
So, the readdir() processing does NOT need to be kept coherent w.r.t. entries being added or removed to the directory until rewinddir() is called on that stream. | ||||||||||||||||||
| Comment by Andreas Dilger [ 19/Aug/15 ] | ||||||||||||||||||
|
If this is still an area of interest for someone to fix, to start with the mdc_obd_max_pages_per_rpc lproc entry needs to be fixed to allow it to be written (shrunk at least) so that lctl set_param mdc.*.max_pages_per_rpc=N is allowed. See comment in mdc/lproc_mdc.c: /*
* FIXME: below proc entry is provided, but not in used, instead
* sbi->sb_md_brw_size is used, the per obd variable should be used
* when CMD is enabled, and dir pages are managed in MDC layer.
* Remember to enable proc write function.
*/
{ .name = "max_pages_per_rpc",
.fops = &mdc_obd_max_pages_per_rpc_fops },
At that point it would be easy to do a few benchmarks on a decent sized filesystem (a few million files with a variety of directory sizes, including some larger dirs @ 10K+ entries) to see what effect the readdir BRW size has on real-world readdir performance. Clearly there will be some advantage for readdir of larger directories, but if this is outweighed by the performance loss for unlinks in a shared directory then it might be possible to have a reduced readdir size by default (e.g. 32KB or 64KB) that still gives decent readdir performance but reduces overhead for unlinks. | ||||||||||||||||||
| Comment by Gerrit Updater [ 20/Mar/17 ] | ||||||||||||||||||
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26088 | ||||||||||||||||||
| Comment by Gerrit Updater [ 27/Mar/17 ] | ||||||||||||||||||
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26212 | ||||||||||||||||||
| Comment by Gerrit Updater [ 02/May/17 ] | ||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26088/ | ||||||||||||||||||
| Comment by Gerrit Updater [ 21/Sep/17 ] | ||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26212/ | ||||||||||||||||||
| Comment by Peter Jones [ 21/Sep/17 ] | ||||||||||||||||||
|
Landed for 2.11 | ||||||||||||||||||
| Comment by Andreas Dilger [ 21/Sep/17 ] | ||||||||||||||||||
|
This is not actually fixed yet, just minor cleanup patches landed. | ||||||||||||||||||
| Comment by Alex Zhuravlev [ 22/Sep/17 ] | ||||||||||||||||||
|
with c75aa6c74cd86c69de893395735a6725571f11f4 landed I'm getting the following in the local testing: == sanity test 102d: tar restore stripe info from tarfile,not keep osts == 08:27:52 (1506058072) setup_test102() creates d102d.sanity, then test_102d() tries to create it too | ||||||||||||||||||
| Comment by Andreas Dilger [ 19/Apr/18 ] | ||||||||||||||||||
|
nrutman, aboyko, now that patch https://review.whamcloud.com/26088/ "mdc: allow setting readdir RPC size parameter" is landed, could someone at Cray run some testing to see what a better readdir RPC size is for the original workload? From comment-58547 it looks to be somewhere between 2 and 32 pages by default. We could start by reducing the default RPC size to something that gives good performance out of the box for most workloads. Separately, getting a patch to continue caching readdir() data on the client fd after lock cancellation (per comment-58611 and few replies) until close() or rewinddir() (I believe this ends up as seek(0) in the kernel) would allow even better client performance, since it would not need to re-fetch the directory for every deleted file. | ||||||||||||||||||
| Comment by Cory Spitz [ 16/May/18 ] | ||||||||||||||||||
|
I will ask Eugene if he can take a look. | ||||||||||||||||||
| Comment by Nathan Rutman [ 06/Jun/19 ] | ||||||||||||||||||
|
See also | ||||||||||||||||||
| Comment by Andreas Dilger [ 23/Sep/19 ] | ||||||||||||||||||
|
Zam, this is the ticket discussed today that describes possible optimizations for single-threaded "rm -r" and other workloads that are doing mixed readdir() + unlink(). |