Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6800

Significant performance regression with patch LU-5264

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • None
    • master
    • 2
    • 9223372036854775807

    Description

      Durding our performance testing, we found siginicant metadata performance regression with LU-5264 on master.

      # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -n 1000 -p 10 -i 5 -d /scratch1/mdtest.out
      

      master

      SUMMARY: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation:      39552.671      33039.129      37024.828       2875.617
         Directory stat    :      33462.417      29340.691      31662.586       1384.330
         Directory removal :      40938.777      40238.677      40571.960        283.701
         File creation     :      17696.663      17209.531      17542.185        171.470
         File stat         :      33892.041      33429.312      33680.603        170.577
         File read         :      11284.121      11012.694      11220.417        104.978
         File removal      :      39718.200      39449.348      39556.254         90.590
         Tree creation     :       4583.939        700.335       3652.356       1487.449
         Tree removal      :        170.563        156.738        162.935          5.172
      

      keep client version, but revert patch 42fdf8355791cb682c6120f7950bb2ecd50f97aa (LU-5264 obdclass: fix race during key quiescency) on servers.

      SUMMARY: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation:      44937.511      42117.095      43780.402       1335.927
         Directory stat    :     135310.427     129560.951     133625.293       2077.128
         Directory removal :      51525.499      46852.534      49965.297       1611.759
         File creation     :      42978.506      41435.145      42413.409        586.294
         File stat         :     135882.699     133344.886     134466.144        977.577
         File read         :     121788.787     111332.613     116374.190       3351.730
         File removal      :      84827.815      78120.995      80378.741       2522.662
         Tree creation     :       4650.004       3788.893       4268.099        336.241
         Tree removal      :        198.059        129.234        179.980         25.563
      

      Attachments

        Issue Links

          Activity

            [LU-6800] Significant performance regression with patch LU-5264

            First tests running with patch #15558, at TGCC site, does not show the same read perfs regression.
            Site will soon provide their numbers for this ticket.
            More instrumentations will be done.

            bfaccini Bruno Faccini (Inactive) added a comment - First tests running with patch #15558, at TGCC site, does not show the same read perfs regression. Site will soon provide their numbers for this ticket. More instrumentations will be done.

            We have removed patch for LU-5264 from all our file systems. We will discuss the ability to give a try with the current fix for LU-6800 by the end of the month on a test file system.

            I will keep you in touch.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - We have removed patch for LU-5264 from all our file systems. We will discuss the ability to give a try with the current fix for LU-6800 by the end of the month on a test file system. I will keep you in touch.

            Since I am the creator of patch for LU-5264 and thus the unfortunate guilty of this situation, and based on the fact that DDN team has already produced a very good but partial fix, I would like to work more actively and fix this last read performance regression.

            Aurelien, Bruno, since the multi-client competition seems to be the main cause to trigger the issue, could it be possible for me to directly work with you on a site where you heavily hit this problem ?

            bfaccini Bruno Faccini (Inactive) added a comment - Since I am the creator of patch for LU-5264 and thus the unfortunate guilty of this situation, and based on the fact that DDN team has already produced a very good but partial fix, I would like to work more actively and fix this last read performance regression. Aurelien, Bruno, since the multi-client competition seems to be the main cause to trigger the issue, could it be possible for me to directly work with you on a site where you heavily hit this problem ?
            ihara Shuichi Ihara (Inactive) added a comment - - edited

            Please re-open LU-6800, we understood http://review.whamcloud.com/15558 helps a lot, but still not all performance back. Here is test resutls. 32 clients, 128 mdtest process.

            test1 : master (commit-id: fe60e0135ee2334440247cde167b707b223cf11d) branch (includes LU-5264 and patch 15558 )

            # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out
            
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      45237.210      36692.398      40159.293       3669.695
               Directory stat    :     132371.575     129820.230     131383.164       1118.004
               Directory removal :      53873.775      50985.149      52790.576       1285.107
               File creation     :      42732.503      37298.342      40070.221       2219.840
               File stat         :     131527.304     129333.170     130765.529       1013.515
               File read         :      87588.987      67919.964      80344.389       8825.741
               File removal      :      84046.477      80418.268      82668.050       1604.248
               Tree creation     :       4364.520       4032.985       4164.502        143.755
               Tree removal      :        203.587        194.749        200.008          3.799
            

            test2 : master + revert 15558

            # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out
            
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      40422.683      20650.668      30457.842       8072.661
               Directory stat    :      33032.600      27110.270      30459.575       2479.308
               Directory removal :      41611.362      39640.289      40887.059        885.442
               File creation     :      17622.819      17537.572      17581.070         34.824
               File stat         :      33991.557      33935.386      33959.396         23.645
               File read         :      11241.112      10994.112      11104.383        102.558
               File removal      :      40024.327      39973.169      39998.669         20.886
               Tree creation     :       4185.932       3705.216       4007.822        215.092
               Tree removal      :        170.327        164.689        167.062          2.386
            

            test3 : master + revert 15558 + revert 13103

            # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out
            
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      46423.406      37490.161      43188.774       4041.792
               Directory stat    :     134178.816     126241.328     130085.996       3245.214
               Directory removal :      53737.981      44389.098      50171.405       4125.732
               File creation     :      44199.169      37398.927      40834.020       2776.628
               File stat         :     135524.181     130626.893     132934.894       2009.179
               File read         :     100767.654      76374.732      91483.603      10776.519
               File removal      :      86318.162      82618.862      85021.870       1700.945
               Tree creation     :       4634.590       3557.510       4167.598        451.208
               Tree removal      :        201.814        194.397        197.894          3.043
            

            If we compare test3 and test2 resutls, test2 results are significant bad which means patch 13103 caused this performance regression.
            GuZhang at DDN pushed patch 15558 and as far as we can see test1 results, perforamnce was back expect "file read' operation.
            So, patch 15558 helps a lot, but even that, we still see perforamnce regression on "file read" operation. We need more investigate on this to back everything performance back.

            ihara Shuichi Ihara (Inactive) added a comment - - edited Please re-open LU-6800 , we understood http://review.whamcloud.com/15558 helps a lot, but still not all performance back. Here is test resutls. 32 clients, 128 mdtest process. test1 : master (commit-id: fe60e0135ee2334440247cde167b707b223cf11d) branch (includes LU-5264 and patch 15558 ) # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 45237.210 36692.398 40159.293 3669.695 Directory stat : 132371.575 129820.230 131383.164 1118.004 Directory removal : 53873.775 50985.149 52790.576 1285.107 File creation : 42732.503 37298.342 40070.221 2219.840 File stat : 131527.304 129333.170 130765.529 1013.515 File read : 87588.987 67919.964 80344.389 8825.741 File removal : 84046.477 80418.268 82668.050 1604.248 Tree creation : 4364.520 4032.985 4164.502 143.755 Tree removal : 203.587 194.749 200.008 3.799 test2 : master + revert 15558 # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 40422.683 20650.668 30457.842 8072.661 Directory stat : 33032.600 27110.270 30459.575 2479.308 Directory removal : 41611.362 39640.289 40887.059 885.442 File creation : 17622.819 17537.572 17581.070 34.824 File stat : 33991.557 33935.386 33959.396 23.645 File read : 11241.112 10994.112 11104.383 102.558 File removal : 40024.327 39973.169 39998.669 20.886 Tree creation : 4185.932 3705.216 4007.822 215.092 Tree removal : 170.327 164.689 167.062 2.386 test3 : master + revert 15558 + revert 13103 # mpirun -np 128 -ppn 4 -hostfile ./hostfile /work/tools/bin/mdtest -i 3 -n 1000 -d /scratch1/mdtest.out Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 46423.406 37490.161 43188.774 4041.792 Directory stat : 134178.816 126241.328 130085.996 3245.214 Directory removal : 53737.981 44389.098 50171.405 4125.732 File creation : 44199.169 37398.927 40834.020 2776.628 File stat : 135524.181 130626.893 132934.894 2009.179 File read : 100767.654 76374.732 91483.603 10776.519 File removal : 86318.162 82618.862 85021.870 1700.945 Tree creation : 4634.590 3557.510 4167.598 451.208 Tree removal : 201.814 194.397 197.894 3.043 If we compare test3 and test2 resutls, test2 results are significant bad which means patch 13103 caused this performance regression. GuZhang at DDN pushed patch 15558 and as far as we can see test1 results, perforamnce was back expect "file read' operation. So, patch 15558 helps a lot, but even that, we still see perforamnce regression on "file read" operation. We need more investigate on this to back everything performance back.

            Aurélien,

            The issue in the build for bullx has already been reported in duplicate LU-6823. Bull is currently looking at LU-6800 carrefully.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Aurélien, The issue in the build for bullx has already been reported in duplicate LU-6823 . Bull is currently looking at LU-6800 carrefully.

            Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/15648
            Subject: LU-6800 obdclass: change spinlock of key to rwlock
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 5adcce4242802b6be3441b425220bc422926a822

            gerrit Gerrit Updater added a comment - Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/15648 Subject: LU-6800 obdclass: change spinlock of key to rwlock Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 5adcce4242802b6be3441b425220bc422926a822

            Unfortunately, we did not test with 15558. Not sure we will be able to do this on the production system.

            adegremont Aurelien Degremont (Inactive) added a comment - Unfortunately, we did not test with 15558. Not sure we will be able to do this on the production system.

            Hi Aurelien,

            Did you test with or without 15558? Does it help or still have the same problem?

            lixi Li Xi (Inactive) added a comment - Hi Aurelien, Did you test with or without 15558? Does it help or still have the same problem?

            FYI, at CEA, we faced heavy load on MDT with several codes. This was introducing bad performance and instability on the filesystem, so we decided to revert the patch from LU-5264 for now, until we get something better.

            adegremont Aurelien Degremont (Inactive) added a comment - FYI, at CEA, we faced heavy load on MDT with several codes. This was introducing bad performance and instability on the filesystem, so we decided to revert the patch from LU-5264 for now, until we get something better.

            As far as we know, http://review.whamcloud.com/15558/ is not perfect. It helps to get perforamnce back on most of metadata operation, but the file read operation is still slow before appled LU-5264.
            I will post benchmark resutls soon.

            ihara Shuichi Ihara (Inactive) added a comment - As far as we know, http://review.whamcloud.com/15558/ is not perfect. It helps to get perforamnce back on most of metadata operation, but the file read operation is still slow before appled LU-5264 . I will post benchmark resutls soon.
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            People

              bfaccini Bruno Faccini (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: