Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13287

DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Details

    • Improvement
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.13.0
    • 9223372036854775807

    Description

      While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

      1 MDTs - 83,948
      2 MDTs - 115,929
      3 MDTs - 123,186
      4 MDTs - 130,846

      Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

      Attachments

        Issue Links

          Activity

            [LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

            I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            koutoupis Petros Koutoupis added a comment - I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            koutoupis Petros Koutoupis added a comment - I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            @Olaf Faaland,

            I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

            koutoupis Petros Koutoupis added a comment - @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.
            ofaaland Olaf Faaland added a comment -

            We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

            ofaaland Olaf Faaland added a comment - We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.
            spitzcor Cory Spitz added a comment - - edited

            >> Andreas wrote:

            Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

            > Petros wrote:

            The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            spitzcor Cory Spitz added a comment - - edited >> Andreas wrote: Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. > Petros wrote: The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected? More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            Andreas,

             

            > Is this with one MDT per MDS, or are all four MDTs on the same MDS? 

            1 MDT per MDS (each on one).

             

            > How many clients are being used for this test?

            It was 60 clients.

             

            > Does the performance improve when there are additional clients added for the 3/4 MDT test cases?

            We have not added more clients than this.

             

            > Having the actual test command line included in the problem description would make this report a lot more useful. 

            We had four MDTs

            lfs mkdir -c 4 <remote directory> [-D]

            mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>

             

            We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients.

            With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds], same behavior.

             

            > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. 

            The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            koutoupis Petros Koutoupis added a comment - Andreas,   > Is this with one MDT per MDS, or are all four MDTs on the same MDS?  1 MDT per MDS (each on one).   > How many clients are being used for this test? It was 60 clients.   > Does the performance improve when there are additional clients added for the 3/4 MDT test cases? We have not added more clients than this.   > Having the actual test command line included in the problem description would make this report a lot more useful.  We had four MDTs lfs mkdir -c 4 <remote directory> [-D] mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>   We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients. With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds] , same behavior.   > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.  The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            Is this with one MDT per MDS, or are all four MDTs on the same MDS? If all MDTs are on the same MDS, then this is totally expected, as there just isn't enough unused CPU/network on the MDS to double or quadruple the performance on that node.

            How many clients are being used for this test? Does the performance improve when there are additional clients added for the 3/4 MDT test cases? Having the actual test command line included in the problem description would make this report a lot more useful.

            Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. I suspect in that case they didn't have enough clients to drive the aggregate MDT performance to saturation.

            adilger Andreas Dilger added a comment - Is this with one MDT per MDS, or are all four MDTs on the same MDS? If all MDTs are on the same MDS, then this is totally expected, as there just isn't enough unused CPU/network on the MDS to double or quadruple the performance on that node. How many clients are being used for this test? Does the performance improve when there are additional clients added for the 3/4 MDT test cases? Having the actual test command line included in the problem description would make this report a lot more useful. Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. I suspect in that case they didn't have enough clients to drive the aggregate MDT performance to saturation.
            spitzcor Cory Spitz added a comment -

            Possibly also reported and related to LU-9436.

            spitzcor Cory Spitz added a comment - Possibly also reported and related to LU-9436 .

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: