Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13287

DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

Details

    • Improvement
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.13.0
    • 9223372036854775807

    Description

      While testing in an environment with a single parent directory following by 1 shared sub directory for all client mdtest ranks, we are observe very little scaling when moving to more than 2 MDTs. See below for 1 million objects per MDT, 0K File Creates:

      1 MDTs - 83,948
      2 MDTs - 115,929
      3 MDTs - 123,186
      4 MDTs - 130,846

      Stats and deletes are showcasing similar results. It seems to not follow a linear scale but instead plateaus. It would also seem that we are not the only ones to observe this. A recent Cambridge University IO-500 presentation presented a slide with very similar results (fourth from the bottom): https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf

      Attachments

        Issue Links

          Activity

            [LU-13287] DNE2 - Shared directory performance does not scale and starts to plateau beyond 2MDTs

            Reopen to change resolution.

            adilger Andreas Dilger added a comment - Reopen to change resolution.

            It would seem that with enough client load, that we are able to drive proper DNE2 single shared directory scaling.

            koutoupis Petros Koutoupis added a comment - It would seem that with enough client load, that we are able to drive proper DNE2 single shared directory scaling.
            pjones Peter Jones added a comment -

            Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            pjones Peter Jones added a comment - Sorry about that - two similarly named groups lured me into an error. Please have another go - I think I got it this time

            @Peter Jones

            Still cannot close the ticket. It is not even an option.

            koutoupis Petros Koutoupis added a comment - @Peter Jones Still cannot close the ticket. It is not even an option.
            pjones Peter Jones added a comment -

            koutoupis try again

            pjones Peter Jones added a comment - koutoupis try again

            @Andreas Dilger,

            It seems that I do not have the proper rights to close this ticket. Please advise.

            koutoupis Petros Koutoupis added a comment - @Andreas Dilger, It seems that I do not have the proper rights to close this ticket. Please advise.

            Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            koutoupis Petros Koutoupis added a comment - Added the tarball archive_smaller_inodes-tests.tar.gz and an accompanying powerpoint archive_smaller_inodes-tests.pptx which highlights DNE2 single shared directory scaling utilizing the large Moon cluster over at LANL. We were able to drive load from 512 clients and starting from a single server, double it at each iteration until we reached 32 MDTs. With enough clients, it seems that there was a reasonable amount of scaling and that this issue becomes much less of a concern. I will close this ticket unless there are objections to my doing so.

            I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            koutoupis Petros Koutoupis added a comment - I also shared shared-directory_mdt-perf.tar.gz, which consists of the flamegraphs of the original test that correlate to the numbers posted above in the description. Note that in the tarball, mdt0-1total consists of the single MDT testing while the rest of the subdirectories inside the archive are each MDT in a 4 MDT configuration.

            I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            koutoupis Petros Koutoupis added a comment - I have attached some flamegraphs and perf reports (MDT-DNE2-shareddir_flamegraphs.zip) for a 2 MDT and then 4 MDT configuration during load. Anyway, I am able to provide any data or traces upon request. Also, I have run the same tests using normal creates and then again with mknod with the same scaling results. Any thoughts, ideas, etc?

            @Olaf Faaland,

            I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

            koutoupis Petros Koutoupis added a comment - @Olaf Faaland, I have tested master from 3 weeks ago or so and the results are the same. Part of the challenges that I am facing here is: how much of this minimal scaling is expected and how much room do we have to make it better? Earlier presentations posted online show that between 1-4 MDTs running mknod tests show some scaling but these were running an older build of Lustre and since then our single MDT performance has gotten exponentially better. Today, when I run mknod tests, the scaling results are no different than my creates.

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: