[LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.13.0
Labels:
- dne2
- llnl
- performance

Rank (Obsolete):
9223372036854775807

Description

While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

1 MDTs - 83,948
2 MDTs - 115,929
3 MDTs - 123,186
4 MDTs - 130,846

Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

archive_smaller_inodes-tests.pptx
627 kB
19/May/20 4:52 PM
archive_smaller_inodes-tests.tar.gz
166 kB
19/May/20 4:50 PM
MDT-DNE2-shareddir_flamegraphs.zip
11.46 MB
09/Mar/20 2:04 PM
Screen Shot 2020-02-21 at 16.11.52.png
64 kB
21/Feb/20 11:12 PM
shared-directory_mdt-perf.tar.gz
6.78 MB
09/Mar/20 2:26 PM

Issue Links

is related to

LU-9436 DNE2 - performance improvement with wide stripping directory

Open

Activity

[LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Petros Koutoupis added a comment - 09/Mar/20 2:27 PM

I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

Petros Koutoupis added a comment - 09/Mar/20 2:27 PM I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

Petros Koutoupis added a comment - 09/Mar/20 2:07 PM

I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

Petros Koutoupis added a comment - 09/Mar/20 2:07 PM I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

Petros Koutoupis added a comment - 28/Feb/20 2:13 PM

@Olaf Faaland,

I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

Petros Koutoupis added a comment - 28/Feb/20 2:13 PM @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

Olaf Faaland added a comment - 25/Feb/20 6:39 PM

We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

Olaf Faaland added a comment - 25/Feb/20 6:39 PM We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

Cory Spitz added a comment - 24/Feb/20 9:21 PM - edited

>> Andreas wrote:

Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

> Petros wrote:

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

Cory Spitz added a comment - 24/Feb/20 9:21 PM - edited >> Andreas wrote: Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. > Petros wrote: The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected? More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

Petros Koutoupis added a comment - 24/Feb/20 4:32 PM

Andreas,

> Is this with one MDT per MDS, or are all four MDTs on the same MDS?

1 MDT per MDS (each on one).

> How many clients are being used for this test?

It was 60 clients.

> Does the performance improve when there are additional clients added for the 3/4 MDT test cases?

We have not added more clients than this.

> Having the actual test command line included in the problem description would make this report a lot more useful.

We had four MDTs

lfs mkdir -c 4 <remote directory> [-D]

mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>

We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients.

With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds], same behavior.

> Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

Petros Koutoupis added a comment - 24/Feb/20 4:32 PM Andreas, > Is this with one MDT per MDS, or are all four MDTs on the same MDS? 1 MDT per MDS (each on one). > How many clients are being used for this test? It was 60 clients. > Does the performance improve when there are additional clients added for the 3/4 MDT test cases? We have not added more clients than this. > Having the actual test command line included in the problem description would make this report a lot more useful. We had four MDTs lfs mkdir -c 4 <remote directory> [-D] mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR> We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients. With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds] , same behavior. > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

Andreas Dilger added a comment - 21/Feb/20 11:22 PM

Is this with one MDT per MDS, or are all four MDTs on the same MDS? If all MDTs are on the same MDS, then this is totally expected, as there just isn't enough unused CPU/network on the MDS to double or quadruple the performance on that node.

How many clients are being used for this test? Does the performance improve when there are additional clients added for the 3/4 MDT test cases? Having the actual test command line included in the problem description would make this report a lot more useful.

Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. I suspect in that case they didn't have enough clients to drive the aggregate MDT performance to saturation.

Andreas Dilger added a comment - 21/Feb/20 11:22 PM Is this with one MDT per MDS, or are all four MDTs on the same MDS? If all MDTs are on the same MDS, then this is totally expected, as there just isn't enough unused CPU/network on the MDS to double or quadruple the performance on that node. How many clients are being used for this test? Does the performance improve when there are additional clients added for the 3/4 MDT test cases? Having the actual test command line included in the problem description would make this report a lot more useful. Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. I suspect in that case they didn't have enough clients to drive the aggregate MDT performance to saturation.

Cory Spitz added a comment - 21/Feb/20 10:41 PM

Possibly also reported and related to LU-9436.

Cory Spitz added a comment - 21/Feb/20 10:41 PM Possibly also reported and related to LU-9436 .

People

Assignee:: WC Triage

Reporter:: Petros Koutoupis

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Feb/20 8:39 PM

Updated:: 20/May/20 12:00 AM

Resolved:: 20/May/20 12:00 AM