[LU-1167] Poor mdtest unlink performance with multiple processes per node - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.1.1
Labels:
None
Environment:
SL6.1,
2.6.32-131.12.1.el6.lustre.20.x86_64

Severity:
3
Epic:
- metadata
- performance
Rank (Obsolete):
7703

Description

We have noticed in testing that running multiple mdtest processes per node severely degrades unlink performance.

Lustre mounted once per client; not multimount.
shared directory case
This can be seen at a wide range of node counts (5-128) and backends, to varying degrees.
Interestingly scaling the client count up does not seem to have nearly the same negative performance impact; only the ppn seems to matter.

	total
nodes	jobs	unlink kops
8	8	11.4
8	16	9.7
8	32	8.6
8	64	7

We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

license.txt
2 kB
08/May/13 9:18 PM
metabench.tar
4.39 MB
08/May/13 9:24 PM
metabench-compare.xlsx
10 kB
12/Apr/13 3:37 PM
metabench-comparison.txt
24 kB
12/Apr/13 3:39 PM

Issue Links

is related to

LU-3308 large readdir chunk size slows unlink/"rm -r" performance

Reopened

LU-1695 Demonstrate MDS performance with increasing client load for SMP Affinity

Resolved

Activity

[LU-1167] Poor mdtest unlink performance with multiple processes per node

Cory Spitz added a comment - 10/May/13 3:31 AM

It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).

Cory Spitz added a comment - 10/May/13 3:31 AM It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).

Mark Mansk added a comment - 08/May/13 9:24 PM

metabench source

Mark Mansk added a comment - 08/May/13 9:24 PM metabench source

Cory Spitz added a comment - 08/May/13 6:38 PM

Bug should be marked as affects all 2.x versions.

Cory Spitz added a comment - 08/May/13 6:38 PM Bug should be marked as affects all 2.x versions.

Mark Mansk added a comment - 12/Apr/13 3:39 PM

data from metabench runs for 1.8.6,2.3&2.4

Mark Mansk added a comment - 12/Apr/13 3:39 PM data from metabench runs for 1.8.6,2.3&2.4

Mark Mansk added a comment - 12/Apr/13 3:37 PM

excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x.

Mark Mansk added a comment - 12/Apr/13 3:37 PM excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x.

Cory Spitz added a comment - 15/Apr/12 1:16 AM

We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need?

Cory Spitz added a comment - 15/Apr/12 1:16 AM We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need?

Nathan Rutman added a comment - 11/Apr/12 3:21 PM

Cory: yes, but it's a little unclear.
I'm having trouble finding some old performance numbers at scale; I was really hoping Cliff had some older Hyperion numbers using 1.8.x

Nathan Rutman added a comment - 11/Apr/12 3:21 PM Cory: yes, but it's a little unclear. I'm having trouble finding some old performance numbers at scale; I was really hoping Cliff had some older Hyperion numbers using 1.8.x

Cory Spitz added a comment - 04/Apr/12 4:24 PM

If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics.

Cory Spitz added a comment - 04/Apr/12 4:24 PM If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics.

Nathan Rutman added a comment - 04/Apr/12 4:11 PM

Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers.

Nathan Rutman added a comment - 04/Apr/12 4:11 PM Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers.

Cory Spitz added a comment - 04/Apr/12 2:04 PM

Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x?

Cory Spitz added a comment - 04/Apr/12 2:04 PM Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x?

Nathan Rutman added a comment - 04/Apr/12 1:41 PM

Continuing to pursue this, there are four identified bottlenecks for unlink performance:

a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks.

b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note ~~LU-933~~ has a patch to remove this in an unsafe way, for testing.)

c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here.

d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops (~~LU-50~~) in Lustre 2.2.0 should help with this.

Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?

Nathan Rutman added a comment - 04/Apr/12 1:41 PM Continuing to pursue this, there are four identified bottlenecks for unlink performance: a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks. b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note LU-933 has a patch to remove this in an unsafe way, for testing.) c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here. d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops ( LU-50 ) in Lustre 2.2.0 should help with this. Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?

People

Assignee:: WC Triage

Reporter:: Nathan Rutman

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 02/Mar/12 8:09 PM

Updated:: 10/May/13 8:08 AM