[LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs Created: 21/Feb/20  Updated: 20/May/20  Resolved: 20/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Petros Koutoupis Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: dne2, llnl, performance

Attachments: Zip Archive MDT-DNE2-shareddir_flamegraphs.zip     PNG File Screen Shot 2020-02-21 at 16.11.52.png     Microsoft PowerPoint archive_smaller_inodes-tests.pptx     File archive_smaller_inodes-tests.tar.gz     File shared-directory_mdt-perf.tar.gz    
Issue Links:
Related
is related to LU-9436 DNE2 - performance improvement with w... Open
Rank (Obsolete): 9223372036854775807

 Description   

While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

1 MDTs - 83,948
2 MDTs - 115,929
3 MDTs - 123,186
4 MDTs - 130,846

Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf



 Comments   
Comment by Cory Spitz [ 21/Feb/20 ]

Possibly also reported and related to LU-9436.

Comment by Andreas Dilger [ 21/Feb/20 ]

Is this with one MDT per MDS, or are all four MDTs on the same MDS? If all MDTs are on the same MDS, then this is totally expected, as there just isn't enough unused CPU/network on the MDS to double or quadruple the performance on that node.

How many clients are being used for this test? Does the performance improve when there are additional clients added for the 3/4 MDT test cases? Having the actual test command line included in the problem description would make this report a lot more useful.

Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. I suspect in that case they didn't have enough clients to drive the aggregate MDT performance to saturation.

Comment by Petros Koutoupis [ 24/Feb/20 ]

Andreas,

 

> Is this with one MDT per MDS, or are all four MDTs on the same MDS? 

1 MDT per MDS (each on one).

 

> How many clients are being used for this test?

It was 60 clients.

 

> Does the performance improve when there are additional clients added for the 3/4 MDT test cases?

We have not added more clients than this.

 

> Having the actual test command line included in the problem description would make this report a lot more useful. 

We had four MDTs

lfs mkdir -c 4 <remote directory> [-D]

mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>

 

We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients.

With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds], same behavior.

 

> Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. 

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

Comment by Cory Spitz [ 24/Feb/20 ]

>> Andreas wrote:

Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

> Petros wrote:

The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

Comment by Olaf Faaland [ 25/Feb/20 ]

We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

Comment by Petros Koutoupis [ 28/Feb/20 ]

@Olaf Faaland,

I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

Comment by Petros Koutoupis [ 09/Mar/20 ]

I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

Comment by Petros Koutoupis [ 09/Mar/20 ]

I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

Comment by Petros Koutoupis [ 19/May/20 ]

Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

Comment by Petros Koutoupis [ 19/May/20 ]

@Andreas Dilger,

It seems that I do not have the proper rights to close this ticket. Please advise.

Comment by Peter Jones [ 19/May/20 ]

koutoupis try again

Comment by Petros Koutoupis [ 19/May/20 ]

@Peter Jones

Still cannot close the ticket. It is not even an option.

Comment by Peter Jones [ 19/May/20 ]

Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

Comment by Petros Koutoupis [ 19/May/20 ]

It would seem that with enough client load, that we are able to drive proper DNE2 single shared directory scaling.

Comment by Andreas Dilger [ 19/May/20 ]

Reopen to change resolution.

Generated at Sat Feb 10 02:59:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.