Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9972

Performance regressions on unique directory removal

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.4
    • None
    • None
    • 2.10 (and 2.11)
    • 3
    • 9223372036854775807

    Description

      There is a performance regression on dir removal.

      Server and client : RHEL7.3
      Lustre version : 2.10.52
      Backend filesystem: ldiskfs

      mpirun --allow-run-as-root /work/tools/bin/mdtest -n 5000 -v -d /scratch0/mdtest.out -D -i 3 -p 10 -w 0 -u

      SUMMARY: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation:      89757.381      65618.928      74607.900      10774.356
         Directory stat    :     320946.433     319888.242     320294.264        465.749
         Directory removal :      19028.569      17837.487      18351.200        499.838
         Tree creation     :        434.446        158.826        318.943        116.860
         Tree removal      :         27.018         25.210         26.281          0.775
      

      Attachments

        Issue Links

          Activity

            [LU-9972] Performance regressions on unique directory removal

            Our hardware config has changed a bit since 2.9, we have seen noticeable improvements since changing the tuned-adm profile. Of course all our old results are on Sharepoint:
            If you look at the most current spreadsheet, you will see the jump in Dir rm with the tuned-adm change:
            http://tinyurl.com/ydzx7gxp

            If you look at our last EE 3.0 runs from June 2017, you will see Dir rm is 4x better. (b_ieel3_0 build 214) So I would look at some deltas there: http://tinyurl.com/yanedznq

            cliffw Cliff White (Inactive) added a comment - Our hardware config has changed a bit since 2.9, we have seen noticeable improvements since changing the tuned-adm profile. Of course all our old results are on Sharepoint: If you look at the most current spreadsheet, you will see the jump in Dir rm with the tuned-adm change: http://tinyurl.com/ydzx7gxp If you look at our last EE 3.0 runs from June 2017, you will see Dir rm is 4x better. (b_ieel3_0 build 214) So I would look at some deltas there: http://tinyurl.com/yanedznq

            I think the problem has been exist in b2_9 at least.
            Here is results of same test on b2_9. (keep 2.10.1 for client, but just chnged server with b2_9)

            mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D -u (unique directory)

            SUMMARY: (of 3 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      91409.935      72314.781      84242.568       8491.267
               Directory stat    :     184806.326     183688.542     184367.927        487.111
               Directory removal :      20718.518      20303.157      20555.893        181.147
               Tree creation     :        552.285        400.441        473.117         62.160
               Tree removal      :         40.413         29.341         35.321          4.563
            V-1: Entering timestamp...
            

            mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D (shared directory)

            SUMMARY: (of 3 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      70310.000      45717.790      58926.161      10122.282
               Directory stat    :     178080.331     175913.598     176783.485        934.667
               Directory removal :      86194.900      72838.446      79018.261       5498.119
               Tree creation     :       5527.274       2804.821       3744.496       1261.231
               Tree removal      :         80.959         24.059         61.936         26.784
            V-1: Entering timestamp...
            
            ihara Shuichi Ihara (Inactive) added a comment - I think the problem has been exist in b2_9 at least. Here is results of same test on b2_9. (keep 2.10.1 for client, but just chnged server with b2_9) mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D -u (unique directory) SUMMARY: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 91409.935 72314.781 84242.568 8491.267 Directory stat : 184806.326 183688.542 184367.927 487.111 Directory removal : 20718.518 20303.157 20555.893 181.147 Tree creation : 552.285 400.441 473.117 62.160 Tree removal : 40.413 29.341 35.321 4.563 V-1: Entering timestamp... mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D (shared directory) SUMMARY: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 70310.000 45717.790 58926.161 10122.282 Directory stat : 178080.331 175913.598 176783.485 934.667 Directory removal : 86194.900 72838.446 79018.261 5498.119 Tree creation : 5527.274 2804.821 3744.496 1261.231 Tree removal : 80.959 24.059 61.936 26.784 V-1: Entering timestamp...

            Cliff, do we have similar mdtest results from the performance test cluster, in particular 2.10.0/1, 2.10.52/53, and 2.9.x? That would give us a ballpark of where this performance regression has been introduced, and allow git bisect to narrow it down to a particular patch.

            adilger Andreas Dilger added a comment - Cliff, do we have similar mdtest results from the performance test cluster, in particular 2.10.0/1, 2.10.52/53, and 2.9.x? That would give us a ballpark of where this performance regression has been introduced, and allow git bisect to narrow it down to a particular patch.

            Sorry delay response. I needed to change hardware configuration, but here is new results on b2_10 (2.10.1_RC1).
            rmdir to unique directories is obviously slow compared to same benchmark to a shared directory.

            mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D (for shared directory )
            mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D -u (for unique directory )
            32 clients, 128 processes. both of them were collected on exact same hardware configuration.

            Here is a directory operations to a shared directory.

            SUMMARY: (of 3 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      91979.485      69249.863      79842.797       9343.315
               Directory stat    :     197008.811     180039.999     189342.439       7023.422
               Directory removal :     140527.764     128798.718     133567.639       5032.803
               Tree creation     :       5462.720       1034.229       3084.207       1822.788
               Tree removal      :         92.639         74.702         86.019          8.041
            

            And here is unique directory's results.

              
            SUMMARY: (of 3 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      84094.691      75575.177      80444.764       3583.407
               Directory stat    :     463370.743     431285.266     448299.685      13170.724
               Directory removal :      18722.965      18461.182      18558.573        116.903
               Tree creation     :        593.577        310.356        472.213        119.117
               Tree removal      :         37.275         33.999         35.691          1.340
            V-1: Entering timestamp...
            
            ihara Shuichi Ihara (Inactive) added a comment - Sorry delay response. I needed to change hardware configuration, but here is new results on b2_10 (2.10.1_RC1). rmdir to unique directories is obviously slow compared to same benchmark to a shared directory. mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D (for shared directory ) mpirun -np 128 mdtest -n 5000 -v -d /scratch0/mdtest.out -i 3 -p 30 -D -u (for unique directory ) 32 clients, 128 processes. both of them were collected on exact same hardware configuration. Here is a directory operations to a shared directory. SUMMARY: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 91979.485 69249.863 79842.797 9343.315 Directory stat : 197008.811 180039.999 189342.439 7023.422 Directory removal : 140527.764 128798.718 133567.639 5032.803 Tree creation : 5462.720 1034.229 3084.207 1822.788 Tree removal : 92.639 74.702 86.019 8.041 And here is unique directory's results. SUMMARY: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 84094.691 75575.177 80444.764 3583.407 Directory stat : 463370.743 431285.266 448299.685 13170.724 Directory removal : 18722.965 18461.182 18558.573 116.903 Tree creation : 593.577 310.356 472.213 119.117 Tree removal : 37.275 33.999 35.691 1.340 V-1: Entering timestamp...

            For example lustre-2.7(IEEL3.0)/CentOS7.3

            SUMMARY: (of 3 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      46577.991      42249.894      44871.081       1881.494
               Directory stat    :     373243.136     367643.706     370043.774       2354.791
               Directory removal :      78530.701      66152.245      72781.092       5091.584
               File creation     :     107283.764      96953.405     103118.187       4447.973
               File stat         :     385082.155     375112.919     379387.910       4191.828
               File read         :     185463.654     177089.199     182367.310       3750.818
               File removal      :     127467.768     113218.809     122566.251       6612.256
               Tree creation     :        349.409         91.996        262.234        120.388
               Tree removal      :         20.765         18.039         19.132          1.176
            

            I'm going to test lustre-2.9 to compare.

            ihara Shuichi Ihara (Inactive) added a comment - For example lustre-2.7(IEEL3.0)/CentOS7.3 SUMMARY: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 46577.991 42249.894 44871.081 1881.494 Directory stat : 373243.136 367643.706 370043.774 2354.791 Directory removal : 78530.701 66152.245 72781.092 5091.584 File creation : 107283.764 96953.405 103118.187 4447.973 File stat : 385082.155 375112.919 379387.910 4191.828 File read : 185463.654 177089.199 182367.310 3750.818 File removal : 127467.768 113218.809 122566.251 6612.256 Tree creation : 349.409 91.996 262.234 120.388 Tree removal : 20.765 18.039 19.132 1.176 I'm going to test lustre-2.9 to compare.

            Compared to which version/kernel?

            adilger Andreas Dilger added a comment - Compared to which version/kernel?

            People

              bzzz Alex Zhuravlev
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: