Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13287

DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Details

    • Improvement
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.13.0
    • 9223372036854775807

    Description

      While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

      1 MDTs - 83,948
      2 MDTs - 115,929
      3 MDTs - 123,186
      4 MDTs - 130,846

      Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

      Attachments

        Issue Links

          Activity

            [LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs
            pjones Peter Jones added a comment -

            Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            pjones Peter Jones added a comment - Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            @Peter Jones

            Still cannot close the ticket. It is not even an option.

            koutoupis Petros Koutoupis added a comment - @Peter Jones Still cannot close the ticket. It is not even an option.
            pjones Peter Jones added a comment -

            koutoupis try again

            pjones Peter Jones added a comment - koutoupis try again

            @Andreas Dilger,

            It seems that I do not have the proper rights to close this ticket. Please advise.

            koutoupis Petros Koutoupis added a comment - @Andreas Dilger, It seems that I do not have the proper rights to close this ticket. Please advise.

            Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            koutoupis Petros Koutoupis added a comment - Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            koutoupis Petros Koutoupis added a comment - I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            koutoupis Petros Koutoupis added a comment - I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            @Olaf Faaland,

            I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

            koutoupis Petros Koutoupis added a comment - @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.
            ofaaland Olaf Faaland added a comment -

            We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

            ofaaland Olaf Faaland added a comment - We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.
            spitzcor Cory Spitz added a comment - - edited

            >> Andreas wrote:

            Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

            > Petros wrote:

            The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            spitzcor Cory Spitz added a comment - - edited >> Andreas wrote: Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. > Petros wrote: The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected? More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            Andreas,

             

            > Is this with one MDT per MDS, or are all four MDTs on the same MDS? 

            1 MDT per MDS (each on one).

             

            > How many clients are being used for this test?

            It was 60 clients.

             

            > Does the performance improve when there are additional clients added for the 3/4 MDT test cases?

            We have not added more clients than this.

             

            > Having the actual test command line included in the problem description would make this report a lot more useful. 

            We had four MDTs

            lfs mkdir -c 4 <remote directory> [-D]

            mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>

             

            We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients.

            With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds], same behavior.

             

            > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. 

            The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            koutoupis Petros Koutoupis added a comment - Andreas,   > Is this with one MDT per MDS, or are all four MDTs on the same MDS?  1 MDT per MDS (each on one).   > How many clients are being used for this test? It was 60 clients.   > Does the performance improve when there are additional clients added for the 3/4 MDT test cases? We have not added more clients than this.   > Having the actual test command line included in the problem description would make this report a lot more useful.  We had four MDTs lfs mkdir -c 4 <remote directory> [-D] mdtest -i 3 -p 30 -F -C -E -T -r -n $(( 1048576 / $PROCS )*Num_MDTs) -v -d $<remote directory/OUTDIR>   We will do 1 million objects per MDT…so for this test, we have 4 MDTs, so we did 4 Million objects. Again, 60 Clients. With the same mdtest with -u flag, we see good scaling with 4 MDTs, remove the -u flag to not do unique directory operation per rank (shared directory), the lack of scaling is present. We even tried mdtest with and without -g flag [in the mainline latest builds] , same behavior.   > Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.  The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: