Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1167

Poor mdtest unlink performance with multiple processes per node

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.1.1
    • None
    • SL6.1,
      2.6.32-131.12.1.el6.lustre.20.x86_64

    Description

      We have noticed in testing that running multiple mdtest processes per node severely degrades unlink performance.

      • Lustre mounted once per client; not multimount.
      • shared directory case
        This can be seen at a wide range of node counts (5-128) and backends, to varying degrees.
        Interestingly scaling the client count up does not seem to have nearly the same negative performance impact; only the ppn seems to matter.
      	total
      nodes	jobs	unlink kops
      8	8	11.4
      8	16	9.7
      8	32	8.6
      8	64	7
      

      We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers.

      Attachments

        1. license.txt
          2 kB
        2. metabench.tar
          4.39 MB
        3. metabench-compare.xlsx
          10 kB
        4. metabench-comparison.txt
          24 kB

        Issue Links

          Activity

            [LU-1167] Poor mdtest unlink performance with multiple processes per node
            spitzcor Cory Spitz added a comment -

            It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).

            spitzcor Cory Spitz added a comment - It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).
            mmansk Mark Mansk added a comment -

            metabench source

            mmansk Mark Mansk added a comment - metabench source
            spitzcor Cory Spitz added a comment -

            Bug should be marked as affects all 2.x versions.

            spitzcor Cory Spitz added a comment - Bug should be marked as affects all 2.x versions.
            mmansk Mark Mansk added a comment -

            data from metabench runs for 1.8.6,2.3&2.4

            mmansk Mark Mansk added a comment - data from metabench runs for 1.8.6,2.3&2.4
            mmansk Mark Mansk added a comment -

            excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x.

            mmansk Mark Mansk added a comment - excel sheet with comparisons of 1.8.6 to 2.3 & 2.4 using metabench. show performance drop from 1.8.6 in deletes in same directory for 2.x.
            spitzcor Cory Spitz added a comment -

            We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need?

            spitzcor Cory Spitz added a comment - We could run some apple-to-apples #s on Cray gear between 1.8.6 and 2.1.1. What kind of scale do you need?

            Cory: yes, but it's a little unclear.
            I'm having trouble finding some old performance numbers at scale; I was really hoping Cliff had some older Hyperion numbers using 1.8.x

            nrutman Nathan Rutman added a comment - Cory: yes, but it's a little unclear. I'm having trouble finding some old performance numbers at scale; I was really hoping Cliff had some older Hyperion numbers using 1.8.x
            spitzcor Cory Spitz added a comment -

            If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics.

            spitzcor Cory Spitz added a comment - If the HW was fixed for the comparison then that doesn't sound like a regression, just a major improvement with different dynamics.

            Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers.

            nrutman Nathan Rutman added a comment - Hmm, that's a good point Cory. It gets a little obfuscated with the number of variations in client count, server type, and storage layout. Our 1.8.6 5-client test didn't show the drop with increasing ppn; it remained constant at 5kops. Our 2.1 8-client test did show the decrease, but started from 13kops and went down to 9kops. So it's a little hard to call this a clear regression, when the 2.1 numbers are all above the 1.8 numbers.
            spitzcor Cory Spitz added a comment -

            Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x?

            spitzcor Cory Spitz added a comment - Nathan, I agree with the bottlenecks that you've identified, but I don't think that any of them are regressions. But, maybe I'm wrong wrt c.? In the description you wrote, "We see the same issue with 1.8.6 clients against the server; we do not see it with 1.8.6 servers." Shouldn't we first focus on the regression from 1.8.6 to 2.x?

            Continuing to pursue this, there are four identified bottlenecks for unlink performance:

            a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks.

            b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note LU-933 has a patch to remove this in an unsafe way, for testing.)

            c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here.

            d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops (LU-50) in Lustre 2.2.0 should help with this.

            Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?

            nrutman Nathan Rutman added a comment - Continuing to pursue this, there are four identified bottlenecks for unlink performance: a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks. b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note LU-933 has a patch to remove this in an unsafe way, for testing.) c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here. d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops ( LU-50 ) in Lustre 2.2.0 should help with this. Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?

            People

              wc-triage WC Triage
              nrutman Nathan Rutman
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: