[LU-1167] Poor mdtest unlink performance with multiple processes per node Created: 02/Mar/12  Updated: 10/May/13

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathan Rutman Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None
Environment:

SL6.1,
2.6.32-131.12.1.el6.lustre.20.x86_64


Attachments: Text File license.txt     Microsoft Word metabench-compare.xlsx     Text File metabench-comparison.txt     File metabench.tar    
Issue Links:
Related
is related to LU-3308 large readdir chunk size slows unlink... Reopened
is related to LU-1695 Demonstrate MDS performance with incr... Resolved
Severity: 3
Epic: metadata, performance
Rank (Obsolete): 7703

 Description   

We have noticed in testing that running multiple mdtest processes per node severely degrades unlink performance.

  • Lustre mounted once per client; not multimount.
  • shared directory case
    This can be seen at a wide range of node counts (5-128) and backends, to varying degrees.
    Interestingly scaling the client count up does not seem to have nearly the same negative performance impact; only the ppn seems to matter.
	total
nodes	jobs	unlink kops
8	8	11.4
8	16	9.7
8	32	8.6
8	64	7

We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers.



 Comments   
Comment by Nathan Rutman [ 02/Mar/12 ]

tabs didn't come through; the columns are nodes, total jobs, and unlink kops.

Anyone else seen this behavior?

Comment by Andreas Dilger [ 04/Mar/12 ]

Nathan,
there are a couple of things worth trying out here:

  • Lustre 2.2 has pdirops on the MDS, so if there is directory contention at the server this would be reduced or eliminated. Presumably this is not a regression from 2.1.0 server performance (i.e. you only compared 1.8.6 and 2.1.1, right)?
  • the patch that Liang made in LU-933 (http://review.whamcloud.com/2084) allows testing concurrent modifying metadata RPCs from the same client (breaks recovery, so NOT suitable for real world usage), but since you indicate the same problem happens with both 1.8 and 2.1 clients against 2.1.1 servers, I suspect that the problem is on the server side
Comment by Nathan Rutman [ 04/Mar/12 ]

- Lustre 2.2 has pdirops on the MDS, so if there is directory contention at the server this would be reduced or eliminated.

There definitely is directory contention (I get much better rates with -u), but I'm still wondering why it should change so dramatically depending on the number of threads per client, and not with the number of clients. Why should having more than one thread on a client have any effect on the overall rate, assuming there are enough clients to saturate the MDS?
Pdirops patch and LU-933 are definitely something I will investigate, but I'd still like to understand the current behavior.

- Presumably this is not a regression from 2.1.0 server performance (i.e. you only compared 1.8.6 and 2.1.1, right)?

Right.

Comment by Nathan Rutman [ 13/Mar/12 ]

Here are some Hyperion results for 8 ppn on 100 nodes, performance also pretty poor.
shared dir
https://maloo.whamcloud.com/sub_tests/9bda4762-6740-11e1-a671-5254004bbbd3
000: File creation : 10373.125 7962.209 9318.090 1006.979
000: File removal : 2532.009 2325.865 2402.659 91.998

unique dir
https://maloo.whamcloud.com/sub_tests/9bf3f5f4-6740-11e1-a671-5254004bbbd3
000: File creation : 10543.279 10210.804 10392.223 137.420
000: File removal : 4178.868 3979.381 4093.730 84.019

Comment by Nathan Rutman [ 04/Apr/12 ]

Continuing to pursue this, there are four identified bottlenecks for unlink performance:

a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks.

b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note LU-933 has a patch to remove this in an unsafe way, for testing.)

c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here.

d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops (LU-50) in Lustre 2.2.0 should help with this.

Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?

Comment by Cory Spitz [ 04/Apr/12 ]

Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x?

Comment by Nathan Rutman [ 04/Apr/12 ]

Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers.

Comment by Cory Spitz [ 04/Apr/12 ]

If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics.

Comment by Nathan Rutman [ 11/Apr/12 ]

Cory: yes, but it's a little unclear.
I'm having trouble finding some old performance numbers at scale; I was really hoping Cliff had some older Hyperion numbers using 1.8.x

Comment by Cory Spitz [ 15/Apr/12 ]

We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need?

Comment by Mark Mansk [ 12/Apr/13 ]

excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x.

Comment by Mark Mansk [ 12/Apr/13 ]

data from metabench runs for 1.8.6,2.3&2.4

Comment by Cory Spitz [ 08/May/13 ]

Bug should be marked as affects all 2.x versions.

Comment by Mark Mansk [ 08/May/13 ]

metabench source

Comment by Cory Spitz [ 10/May/13 ]

It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).

Generated at Sat Feb 10 01:14:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.