[LU-3308] large readdir chunk size slows unlink/"rm -r" performance Created: 10/May/13  Updated: 01/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Nathan Rutman Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: lug23dd, medium, performance

Issue Links:
Duplicate
duplicates LU-5232 cache directory contents on file desc... Resolved
is duplicated by LU-4906 rm -rf triggers too much MDS_READPAGE Resolved
is duplicated by LU-4096 Do not allocate large buffer for read... Resolved
Related
is related to LU-10225 sanity test 1 error: apply rmdir/rm o... Open
is related to LU-1431 Support for larger than 1MB sequentia... Resolved
is related to LU-5232 cache directory contents on file desc... Resolved
is related to LU-15535 deadlock on lli->lli_lsm_sem Open
is related to LU-1167 Poor mdtest unlink performance with m... Open
is related to LU-9458 LustreError: 12764:0:(sec_bulk.c:188:... Resolved
is related to LU-10999 Use readdir cache for lookup when ava... Open
is related to LU-11000 Retain cached dentries under director... Open
is related to LU-17493 restore LDLM cancel on blocking callback Open
is related to LU-8641 speedup run_metabech () : make cleanu... Resolved
Severity: 3
Rank (Obsolete): 8195

 Description   

Shared directory unlinks using metabench seem significantly slower with Lustre 2.x clients versus Lustre 1.8.x

512 processes, 8 ppn, 2.1 server
1.8.6 clients: 18k unlinks/s
2.3 clients: 7k unlinks/s

creates are comprable at 28k creates/s

Mdtest shows no such regression. Digging into metabench a little, it seems that when deleting files, metabench processes (I am told) readdir, select "the next" file, and delete it, effectively racing each other, whereas mdtest deterministically choses the files to delete based on process id.

This seems to imply it's directory locking contention on the MDT.

On the clients, the majority difference in time is spend in ptlrpc_queue_wait (not sure of units):
1.8.8: 347
2.3: 923

On the MDT, the big difference is mdt_object_find_lock:
1.8.8: 51us
2.3: 1110us

Also, using ldlm stats, it seems the 2.3 clients cause twice as many ldlm_bl_callbacks as 1.8.8 clients.

So apparently the 2.3 client is holding directory locks differently than the 1.8 clients. We're still looking into this, but if anyone has thoughts we'd love to hear them.



 Comments   
Comment by Nathan Rutman [ 10/May/13 ]

Xyratex-bug-id: LELUS-138

Comment by Keith Mannthey (Inactive) [ 10/May/13 ]

Is this the metabench you are referring to? http://www.7byte.com/?page=metabench

Have you hand a chance to spin Master lately?

Comment by Cory Spitz [ 10/May/13 ]

Keith, the Metabench code we're using is attached to LU-1167. It is not the version from 7byte.com. Yes, master (well 2.3.63) isn't any better at all. Maybe even a little worse

Comment by Andreas Dilger [ 10/May/13 ]

It would be useful to get the performance of 2.1 clients as well, to see how it stacks up.

The client ptlrpc_queue_wait() is just a side-effect of the client waiting for the server to complete the requests. It might be useful to gather stats (strace?) to see which operations it is waiting on - unlink() or readdir()

One area to check is that the 2.3+ clients will do readdir in large chunks compared to earlier versions. If metabench clients are continually doing readdir and unlinking files, then this might cause a larger amount of data to be sent to all of the clients. An application might do the readdir() once at the beginning, then delete entries (N % num_clients + client_no) so that it only has to do readdir once. Looking at the metabench code, it depends on how large the cache in gibc is, since it is calling readdir() for every entry. It also appears to be doing lstat(filename) on every inode before deleting it, definitely a case of lock ping-pong.

Comment by Oleg Drokin [ 10/May/13 ]

I wonder if ELC broke and that leads to bigger number of blocking callbacks in 2.3?
Do you guys have some traces or what sort of analysis did you perform already?

Comment by Cory Spitz [ 12/May/13 ]

Yes we do have traces and analysis to share, which will be coming shortly.

I don't have 2.1 client performance handy, but I can tell you that 2.2 gives similar results. In fact, 2.2, 2.3, and master/2.4 (2.3.63) are all in the same neighborhood. Interestingly enough, removing the lstat() completely didn't change the rates at all. I don't know if that necessarily throws out any theories.

Also, Nathan noted that single dir unlink scenario in mdtest shows no such regression. But it is also worth noting that the same Metabench behavior of readdir()+lstat()+unlink() doesn't show any regression for the multi-dir case. Would that kind of information rule out the theory on any ELC breakage?

Comment by Cory Spitz [ 13/May/13 ]

We have verified that removing the readdir() + lstat() from Metabench creates mdtest-like results. Using a 2.2 server (with pdirops) did not correct the regression. [It may have boosted the performance, but we didn't gather a baseline on the test HW/SW config].

Performance of the readdir&lstat-less Metabench version on 2.3 was on-par to even the Metabench multidir case (server @ 2.2). Removing the readdir()+lstat() from Metabench had little effect on 1.8.x client runs.

Previous tests with only lstat removed had little to no effect on 2.x clients, but we plan to quickly introduce it to the readdir-less version to further validate that the regression is caused by the interference of readdir with unlink.

Comment by Alex Zhuravlev [ 13/May/13 ]

can statahead be involved somehow? I'd suggest to look at rpc stats on all the sides.

Comment by Oleg Drokin [ 13/May/13 ]

Cory:
Can you please disable stat-ahead and then run two metabench tests: one like the original with everything in place and one with just lstat() removed?
I suspect that in this configuration removing lstat will be helpful and further highlight ELC breakage of some sort.

Comment by Alexander Boyko [ 15/May/13 ]

Xyratex have tested patch with tunable readdir number of pages for data transfer. Here is the result of running 8 clients, 8ppn and total 64k files:

  create unlink
2.3 client 1-page readdir chunk size 14380 8938
2.3 client 2-page readdir chunk size 14339 8419
2.3 client 32-page readdir chunk size 14296 6176
2.3 client 256-page readdir chunk size (default) 14247 3739
1.8.8 client 15026 9819

Big readdir chunk size in 2.x client does cause the lock ping-pong and impact the overall unlink performance for shared directory.

Comment by Andreas Dilger [ 15/May/13 ]

Alexander, thanks for the data.

In this case it makes sense for the client or MDS to auto-tune the readdir chunk size based on usage. If there are readahead directory pages that are being dropped from cache without being accessed, then it makes sense to reduce the readdir chunk size automatically. Unfortunately, I don't think there is any way for the kernel to know that "rm -r" or metabench is the one calling readdir() the first time, so it would still make sense to read a full 1MB of readdir data on first access, and only drop the readdir() chunk size on later access.

Comment by Oleg Drokin [ 15/May/13 ]

Ok, so I think the correct approach here is not to reduce the transfer size, but instead we need to use the other idea we dismissed in the past:

POSIX allows us to retain cached pages for the readdir purposes as long as the filehandle remains open, so if we do that now, we'll still have all the benefits of fast rm (even faster due to no need to refresh the pages all the time) and at the same time avoid needless page pingpong for readdir.

In the past this was deemed unimportant since we did not do statahead so extra pingpong was pretty minimal, but now it becomes more important I think.

Comment by Nathan Rutman [ 15/May/13 ]

@Oleg - but don't the cached readdir pages have to be dropped if other clients are simultaneously deleting files? Are you saying POSIX allows for operation on stale directory entries?

Comment by Oleg Drokin [ 15/May/13 ]

What I mean is posix allows for pages to remain cached for the application that has the filehandle open, for as long as the filehandle remains open.
All other users should get their own pages that are fresh at the time of getting.

So basically now when lockss get cancelled, we are just dropping pages from cache completely.
What we'll need to do instead is to remove reference of them from generic mapping (truncate), but don't drop them completey, instead reference them from all the opened filehandles. Then when doing readding we'll check if we have such pages referenced from the filehandle before doing a lookup in the main cache, and if both are a no, then do an RPC.
Pages would get dereferenced on close, so when the last file opener goes away, they'd be freed (obviously will need some other form of truncation in case of tight memory too).

Comment by Andreas Dilger [ 15/May/13 ]

The relevant sections in the readdir(2) definition in SUSv2 http://pubs.opengroup.org/onlinepubs/007908799/xsh/readdir.html:

... Directory entries represent files; files may be removed from a directory or added to a directory asynchronously to the operation of readdir().

If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.

The readdir() function may buffer several directory entries per actual read operation;

So, the readdir() processing does NOT need to be kept coherent w.r.t. entries being added or removed to the directory until rewinddir() is called on that stream.

Comment by Andreas Dilger [ 19/Aug/15 ]

If this is still an area of interest for someone to fix, to start with the mdc_obd_max_pages_per_rpc lproc entry needs to be fixed to allow it to be written (shrunk at least) so that lctl set_param mdc.*.max_pages_per_rpc=N is allowed. See comment in mdc/lproc_mdc.c:

        /*                           
         * FIXME: below proc entry is provided, but not in used, instead
         * sbi->sb_md_brw_size is used, the per obd variable should be used
         * when CMD is enabled, and dir pages are managed in MDC layer.
         * Remember to enable proc write function.
         */
        { .name =       "max_pages_per_rpc",
          .fops =       &mdc_obd_max_pages_per_rpc_fops },

At that point it would be easy to do a few benchmarks on a decent sized filesystem (a few million files with a variety of directory sizes, including some larger dirs @ 10K+ entries) to see what effect the readdir BRW size has on real-world readdir performance. Clearly there will be some advantage for readdir of larger directories, but if this is outweighed by the performance loss for unlinks in a shared directory then it might be possible to have a reduced readdir size by default (e.g. 32KB or 64KB) that still gives decent readdir performance but reduces overhead for unlinks.

Comment by Gerrit Updater [ 20/Mar/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26088
Subject: LU-3308 mdc: allow setting readdir RPC size parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dd8181ad51af55f7317e7087adb1cad33338158f

Comment by Gerrit Updater [ 27/Mar/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26212
Subject: LU-3308 tests: fix sanity/sanityn test_mkdir() usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6c5270804141c1c4b48be8d357a4686f42c70e18

Comment by Gerrit Updater [ 02/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26088/
Subject: LU-3308 mdc: allow setting readdir RPC size parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 664bad91b5c0e2a5c9423d4b524d9906db9ef9b5

Comment by Gerrit Updater [ 21/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26212/
Subject: LU-3308 tests: fix sanity/sanityn test_mkdir() usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c75aa6c74cd86c69de893395735a6725571f11f4

Comment by Peter Jones [ 21/Sep/17 ]

Landed for 2.11

Comment by Andreas Dilger [ 21/Sep/17 ]

This is not actually fixed yet, just minor cleanup patches landed.

Comment by Alex Zhuravlev [ 22/Sep/17 ]

with c75aa6c74cd86c69de893395735a6725571f11f4 landed I'm getting the following in the local testing:

== sanity test 102d: tar restore stripe info from tarfile,not keep osts == 08:27:52 (1506058072)
mkdir: cannot create directory `/mnt/lustre/d102d.sanity': File exists
sanity test_102d: @@@@@@ FAIL: mkdir '/mnt/lustre/d102d.sanity' failed

setup_test102() creates d102d.sanity, then test_102d() tries to create it too

Comment by Andreas Dilger [ 19/Apr/18 ]

nrutman, aboyko, now that patch https://review.whamcloud.com/26088/ "mdc: allow setting readdir RPC size parameter" is landed, could someone at Cray run some testing to see what a better readdir RPC size is for the original workload? From comment-58547 it looks to be somewhere between 2 and 32 pages by default. We could start by reducing the default RPC size to something that gives good performance out of the box for most workloads.

Separately, getting a patch to continue caching readdir() data on the client fd after lock cancellation (per comment-58611 and few replies) until close() or rewinddir() (I believe this ends up as seek(0) in the kernel) would allow even better client performance, since it would not need to re-fetch the directory for every deleted file.

Comment by Cory Spitz [ 16/May/18 ]

I will ask Eugene if he can take a look.

Comment by Nathan Rutman [ 06/Jun/19 ]

See also LU-3240, esp. patch https://review.whamcloud.com/#/c/7909/

Comment by Andreas Dilger [ 23/Sep/19 ]

Zam, this is the ticket discussed today that describes possible optimizations for single-threaded "rm -r" and other workloads that are doing mixed readdir() + unlink().

Generated at Sat Feb 10 01:32:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.