Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13287

DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Details

    • Improvement
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.13.0
    • 9223372036854775807

    Description

      While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

      1 MDTs - 83,948
      2 MDTs - 115,929
      3 MDTs - 123,186
      4 MDTs - 130,846

      Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

      Attachments

        Issue Links

          Activity

            [LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

            It would seem that with enough client load, that we are able to drive proper DNE2 single shared directory scaling.

            koutoupis Petros Koutoupis added a comment - It would seem that with enough client load, that we are able to drive proper DNE2 single shared directory scaling.
            pjones Peter Jones added a comment -

            Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            pjones Peter Jones added a comment - Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            @Peter Jones

            Still cannot close the ticket. It is not even an option.

            koutoupis Petros Koutoupis added a comment - @Peter Jones Still cannot close the ticket. It is not even an option.
            pjones Peter Jones added a comment -

            koutoupis try again

            pjones Peter Jones added a comment - koutoupis try again

            @Andreas Dilger,

            It seems that I do not have the proper rights to close this ticket. Please advise.

            koutoupis Petros Koutoupis added a comment - @Andreas Dilger, It seems that I do not have the proper rights to close this ticket. Please advise.

            Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            koutoupis Petros Koutoupis added a comment - Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            koutoupis Petros Koutoupis added a comment - I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            koutoupis Petros Koutoupis added a comment - I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            @Olaf Faaland,

            I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

            koutoupis Petros Koutoupis added a comment - @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.
            ofaaland Olaf Faaland added a comment -

            We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.

            ofaaland Olaf Faaland added a comment - We haven't tested this with recent 2.12 or master, but we also saw cases of poor DNE2 scaling in the past.
            spitzcor Cory Spitz added a comment - - edited

            >> Andreas wrote:

            Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test.

            > Petros wrote:

            The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected?

            More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            spitzcor Cory Spitz added a comment - - edited >> Andreas wrote: Looking at the referenced slide from the Cambridge presentation (attached), it actually shows almost linear scaling for additional MDTs (one per MDS) up to 48, excluding the 48-MDT stat test. > Petros wrote: The scaling in the presentation is very minimal as it was in some of our older tests with larger MDT/client counts (up to 512 clients). Is this to be expected? More specifically, the scaling in the chart is about the easy mdtest (shared dir?) and stats. I think the focus of the problem is the scaling/performance of create in a single shared directory.

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: