[LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.13.0
Labels:
- dne2
- llnl
- performance

Rank (Obsolete):
9223372036854775807

Description

While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

1 MDTs - 83,948
2 MDTs - 115,929
3 MDTs - 123,186
4 MDTs - 130,846

Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

archive_smaller_inodes-tests.pptx
627 kB
19/May/20 4:52 PM
archive_smaller_inodes-tests.tar.gz
166 kB
19/May/20 4:50 PM
MDT-DNE2-shareddir_flamegraphs.zip
11.46 MB
09/Mar/20 2:04 PM
Screen Shot 2020-02-21 at 16.11.52.png
64 kB
21/Feb/20 11:12 PM
shared-directory_mdt-perf.tar.gz
6.78 MB
09/Mar/20 2:26 PM

Issue Links

is related to

LU-9436 DNE2 - performance improvement with wide stripping directory

Open

Activity

[LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Peter Jones added a comment - 19/May/20 5:19 PM

Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

Peter Jones added a comment - 19/May/20 5:19 PM Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

Petros Koutoupis added a comment - 19/May/20 5:05 PM

@Peter Jones

Still cannot close the ticket. It is not even an option.

Petros Koutoupis added a comment - 19/May/20 5:05 PM @Peter Jones Still cannot close the ticket. It is not even an option.

Peter Jones added a comment - 19/May/20 4:57 PM

koutoupis try again

Peter Jones added a comment - 19/May/20 4:57 PM koutoupis try again

Petros Koutoupis added a comment - 19/May/20 4:56 PM

@Andreas Dilger,

It seems that I do not have the proper rights to close this ticket. Please advise.

Petros Koutoupis added a comment - 19/May/20 4:56 PM @Andreas Dilger, It seems that I do not have the proper rights to close this ticket. Please advise.

Petros Koutoupis added a comment - 19/May/20 4:54 PM

Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

Petros Koutoupis added a comment - 19/May/20 4:54 PM Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

Petros Koutoupis added a comment - 09/Mar/20 2:27 PM

I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

Petros Koutoupis added a comment - 09/Mar/20 2:27 PM I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

Petros Koutoupis added a comment - 09/Mar/20 2:07 PM

I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

Petros Koutoupis added a comment - 09/Mar/20 2:07 PM I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

Petros Koutoupis added a comment - 28/Feb/20 2:13 PM

@Olaf Faaland,

I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

Petros Koutoupis added a comment - 28/Feb/20 2:13 PM @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

Olaf Faaland added a comment - 25/Feb/20 6:39 PM

We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

Olaf Faaland added a comment - 25/Feb/20 6:39 PM We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

Cory Spitz added a comment - 24/Feb/20 9:21 PM - edited

>> Andreas wrote:

Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

> Petros wrote:

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

Cory Spitz added a comment - 24/Feb/20 9:21 PM - edited >> Andreas wrote: Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. > Petros wrote: The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected? More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

Petros Koutoupis added a comment - 24/Feb/20 4:32 PM

Andreas,

> Is this with one MDT per MDS, or are all four MDTs on the same MDS?

1 MDT per MDS (each on one).

> How many clients are being used for this test?

It was 60 clients.

> Does the performance improve when there are additional clients added for the 3/4 MDT test cases?

We have not added more clients than this.

> Having the actual test command line included in the problem description would make this report a lot more useful.

We had four MDTs

lfs mkdir -c 4 <remote directory> [-D]

mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>

We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients.

With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds], same behavior.

> Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

Petros Koutoupis added a comment - 24/Feb/20 4:32 PM Andreas, > Is this with one MDT per MDS, or are all four MDTs on the same MDS? 1 MDT per MDS (each on one). > How many clients are being used for this test? It was 60 clients. > Does the performance improve when there are additional clients added for the 3/4 MDT test cases? We have not added more clients than this. > Having the actual test command line included in the problem description would make this report a lot more useful. We had four MDTs lfs mkdir -c 4 <remote directory> [-D] mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR> We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients. With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds] , same behavior. > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

People

Assignee:: WC Triage

Reporter:: Petros Koutoupis

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Feb/20 8:39 PM

Updated:: 20/May/20 12:00 AM

Resolved:: 20/May/20 12:00 AM