Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9972

Performance regressions on unique directory removal

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.4
    • None
    • None
    • 2.10 (and 2.11)
    • 3
    • 9223372036854775807

    Description

      There is a performance regression on dir removal.

      Server and client : RHEL7.3
      Lustre version : 2.10.52
      Backend filesystem: ldiskfs

      mpirun --allow-run-as-root /work/tools/bin/mdtest -n 5000 -v -d /scratch0/mdtest.out -D -i 3 -p 10 -w 0 -u

      SUMMARY: (of 3 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation:      89757.381      65618.928      74607.900      10774.356
         Directory stat    :     320946.433     319888.242     320294.264        465.749
         Directory removal :      19028.569      17837.487      18351.200        499.838
         Tree creation     :        434.446        158.826        318.943        116.860
         Tree removal      :         27.018         25.210         26.281          0.775
      

      Attachments

        Issue Links

          Activity

            [LU-9972] Performance regressions on unique directory removal

            I've updated the results to show the commit-order test results for the bisect (not the bisect order), to show there is a clear break between LU-7408 and the next patch LU-7053.

            adilger Andreas Dilger added a comment - I've updated the results to show the commit-order test results for the bisect (not the bisect order), to show there is a clear break between LU-7408 and the next patch LU-7053 .
            jhammond John Hammond added a comment -

            There must have been more runs than just these if you were able to isolate https://review.whamcloud.com/#/c/17092/.

            jhammond John Hammond added a comment - There must have been more runs than just these if you were able to isolate https://review.whamcloud.com/#/c/17092/ .
            standan Saurabh Tandan (Inactive) added a comment - - edited

            There was approximately a drop of 90% in performance fir "Dir removal" for "mdtestfpp" results from tag 2.7.65. Following is the data for all the runs:

            Tag                             Dir removal
            2.7.56                          18298
            2.7.57                         121954
            2.7.61                          64849
            2.7.64                         111655 good
            2.7.64-g63a3e412 (LU-7419)      74384 good
            2.7.64-gc965fc8a (LU-7450)      72374 good
            2.7.64-g6765d785 (LU-7408)      92029 good
            2.7.64-g9ae3a289 (LU-7053)      11517 bad
            2.7.64-g0d3a07a8 (LU-7430)      15114 bad
            2.7.64-g959f8f78 (LU-7573)      11530 bad 
            2.7.65                          11375 bad
            2.7.66                          11403 bad
            2.10.53                         12473
            2.10.54                          9649
            
            
            
            standan Saurabh Tandan (Inactive) added a comment - - edited There was approximately a drop of 90% in performance fir "Dir removal" for "mdtestfpp" results from tag 2.7.65. Following is the data for all the runs: Tag Dir removal 2.7.56 18298 2.7.57 121954 2.7.61 64849 2.7.64 111655 good 2.7.64-g63a3e412 (LU-7419) 74384 good 2.7.64-gc965fc8a (LU-7450) 72374 good 2.7.64-g6765d785 (LU-7408) 92029 good 2.7.64-g9ae3a289 (LU-7053) 11517 bad 2.7.64-g0d3a07a8 (LU-7430) 15114 bad 2.7.64-g959f8f78 (LU-7573) 11530 bad 2.7.65 11375 bad 2.7.66 11403 bad 2.10.53 12473 2.10.54 9649
            pjones Peter Jones added a comment - - edited

            Alex

            I daresay that Saurabh may elaborate but I understand that he has found that your patch LU-7053 (osd: don't lookup object at insert https://review.whamcloud.com/#/c/17092/) is the one that introduced the performance regression with directory removal

            Do you have any ideas on how to avoid this?

            Peter

            pjones Peter Jones added a comment - - edited Alex I daresay that Saurabh may elaborate but I understand that he has found that your patch LU-7053 (osd: don't lookup object at insert https://review.whamcloud.com/#/c/17092/ ) is the one that introduced the performance regression with directory removal Do you have any ideas on how to avoid this? Peter

            Still working on it Shuichi, will soon post some findings.

            standan Saurabh Tandan (Inactive) added a comment - Still working on it Shuichi, will soon post some findings.

            Any progress on finding regression point?

            ihara Shuichi Ihara (Inactive) added a comment - Any progress on finding regression point?

            Saurabh Tandan (saurabh.tandan@intel.com) uploaded a new patch: https://review.whamcloud.com/29126
            Subject: LU-9972 tests: Build required for LU-9972
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 708fec1e34058ec735be819217498f8cc90aa924

            gerrit Gerrit Updater added a comment - Saurabh Tandan (saurabh.tandan@intel.com) uploaded a new patch: https://review.whamcloud.com/29126 Subject: LU-9972 tests: Build required for LU-9972 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 708fec1e34058ec735be819217498f8cc90aa924

            Discussed this with Saurabh and Cliff. Cliff thinks the problem may date back to DNE2 landings, since EE 2.7 predates the DNE2 changes, and they appeared as early as 2.8.0.

            Saurabh will try a git bisect starting with v2_7_50 (== 2.7.0) to see if that has good performance on our test cluster (good ~= 70k rmdir/sec) and go from there. We would like to keep the kernel version the same, at RHEL 7.4, to avoid potential interference with the results from changing the kernel or other configuration options.

            adilger Andreas Dilger added a comment - Discussed this with Saurabh and Cliff. Cliff thinks the problem may date back to DNE2 landings, since EE 2.7 predates the DNE2 changes, and they appeared as early as 2.8.0. Saurabh will try a git bisect starting with v2_7_50 (== 2.7.0) to see if that has good performance on our test cluster (good ~= 70k rmdir/sec) and go from there. We would like to keep the kernel version the same, at RHEL 7.4, to avoid potential interference with the results from changing the kernel or other configuration options.
            pjones Peter Jones added a comment -

            Saurabh

            Please can you narrow down where the change occurred?

            Thanks

            Peter

            pjones Peter Jones added a comment - Saurabh Please can you narrow down where the change occurred? Thanks Peter

            Results from master builds (18 threads 16 client mdtestfpp) :

            Build          Version        Dir create   Dir stat      Dir rm
            master-3596    2.9.58_22      21514        188773        11822
            master-3598    2.9.58_57      21570        209599        11653
            master-3601    2.9.59         21063        223101        11879
            master-2607    2.8.59-35      19328        211813        11797
            master-3637    2.10.52_83     25987        234954        15033
            
            

            Results from EE builds:

            Build          Version        Dir create   Dir stat      Dir rm
            b_ieel3_0-105  2.7.18         21239        167421        73112
            b_ieel3_0-89   2.7.16.1       19758        169050        77119
            b_ieel3_0-204  2.7.19.12      28330        288267        59444
            b_ieel3_0-214  2.7.20.2       28136        331563        60515
            
            adilger Andreas Dilger added a comment - Results from master builds (18 threads 16 client mdtestfpp) : Build Version Dir create Dir stat Dir rm master-3596 2.9.58_22 21514 188773 11822 master-3598 2.9.58_57 21570 209599 11653 master-3601 2.9.59 21063 223101 11879 master-2607 2.8.59-35 19328 211813 11797 master-3637 2.10.52_83 25987 234954 15033 Results from EE builds: Build Version Dir create Dir stat Dir rm b_ieel3_0-105 2.7.18 21239 167421 73112 b_ieel3_0-89 2.7.16.1 19758 169050 77119 b_ieel3_0-204 2.7.19.12 28330 288267 59444 b_ieel3_0-214 2.7.20.2 28136 331563 60515

            Our hardware config has changed a bit since 2.9, we have seen noticeable improvements since changing the tuned-adm profile. Of course all our old results are on Sharepoint:
            If you look at the most current spreadsheet, you will see the jump in Dir rm with the tuned-adm change:
            http://tinyurl.com/ydzx7gxp

            If you look at our last EE 3.0 runs from June 2017, you will see Dir rm is 4x better. (b_ieel3_0 build 214) So I would look at some deltas there: http://tinyurl.com/yanedznq

            cliffw Cliff White (Inactive) added a comment - Our hardware config has changed a bit since 2.9, we have seen noticeable improvements since changing the tuned-adm profile. Of course all our old results are on Sharepoint: If you look at the most current spreadsheet, you will see the jump in Dir rm with the tuned-adm change: http://tinyurl.com/ydzx7gxp If you look at our last EE 3.0 runs from June 2017, you will see Dir rm is 4x better. (b_ieel3_0 build 214) So I would look at some deltas there: http://tinyurl.com/yanedznq

            People

              bzzz Alex Zhuravlev
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: