[LU-1167] Poor mdtest unlink performance with multiple processes per node Created: 02/Mar/12 Updated: 10/May/13 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathan Rutman | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Environment: |
SL6.1, |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Epic: | metadata, performance | ||||||||||||
| Rank (Obsolete): | 7703 | ||||||||||||
| Description |
|
We have noticed in testing that running multiple mdtest processes per node severely degrades unlink performance.
total nodes jobs unlink kops 8 8 11.4 8 16 9.7 8 32 8.6 8 64 7 We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers. |
| Comments |
| Comment by Nathan Rutman [ 02/Mar/12 ] |
|
tabs didn't come through; the columns are nodes, total jobs, and unlink kops. Anyone else seen this behavior? |
| Comment by Andreas Dilger [ 04/Mar/12 ] |
|
Nathan,
|
| Comment by Nathan Rutman [ 04/Mar/12 ] |
There definitely is directory contention (I get much better rates with -u), but I'm still wondering why it should change so dramatically depending on the number of threads per client, and not with the number of clients. Why should having more than one thread on a client have any effect on the overall rate, assuming there are enough clients to saturate the MDS?
Right. |
| Comment by Nathan Rutman [ 13/Mar/12 ] |
|
Here are some Hyperion results for 8 ppn on 100 nodes, performance also pretty poor. unique dir |
| Comment by Nathan Rutman [ 04/Apr/12 ] |
|
Continuing to pursue this, there are four identified bottlenecks for unlink performance: a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks. b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here. d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops ( Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)? |
| Comment by Cory Spitz [ 04/Apr/12 ] |
|
Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x? |
| Comment by Nathan Rutman [ 04/Apr/12 ] |
|
Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers. |
| Comment by Cory Spitz [ 04/Apr/12 ] |
|
If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics. |
| Comment by Nathan Rutman [ 11/Apr/12 ] |
|
Cory: yes, but it's a little unclear. |
| Comment by Cory Spitz [ 15/Apr/12 ] |
|
We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need? |
| Comment by Mark Mansk [ 12/Apr/13 ] |
|
excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x. |
| Comment by Mark Mansk [ 12/Apr/13 ] |
|
data from metabench runs for 1.8.6,2.3&2.4 |
| Comment by Cory Spitz [ 08/May/13 ] |
|
Bug should be marked as affects all 2.x versions. |
| Comment by Mark Mansk [ 08/May/13 ] |
|
metabench source |
| Comment by Cory Spitz [ 10/May/13 ] |
|
It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node). |