Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5663

mds-survey performance regress on master

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 15874

    Description

      Recently I ran mds-survey performance tests on master and b2_5 and found significantly performance regressions.

      The performance data are as follows:

      b2_5:

      test name create lookup getattr setxattr destroy
      p3700_sc0_32t_32dir 157343 1298292 876289 93022 143316
      p3700_sc0_64t_32dir 170724 1378018 914515 99558 138258
      p3700_sc1_32t_32dir 49900 1279590 881454 98467 33632
      p3700_sc1_64t_32dir 40075 1274074 901117 100437 35839

      MASTER(pre-2.7)

      p3700_sc0_32t_32dir 95035 1124743 124594 76011 53653
      p3700_sc0_64t_32dir 40043 1069693 49133 56457 51161
      p3700_sc1_32t_32dir 29693 1106520 120479 60920 37890
      p3700_sc1_64t_32dir 26208 1165051 383138 59853 38974

      PS: p3700 is the test name;
      sc0 means mds-survey stripe count 0 test
      32t means 32 threads
      32dir means 32 directories.

      Therefore p3700_sc1_32t_32dir refers to the test {create, lookup,getattr,setxattr,destroy} files with 1 stripe, and there are 32 threads doing the work against 32 different directories.

      Attachments

        Issue Links

          Activity

            [LU-5663] mds-survey performance regress on master

            I reformatted OST every time for each test. Mostly likely I ran the pre-2.7 test with a smaller journal size.

            jay Jinshan Xiong (Inactive) added a comment - I reformatted OST every time for each test. Mostly likely I ran the pre-2.7 test with a smaller journal size.

            Is it possible you formatted the device differently when you ran the master test for 2.7? Maybe specifying a smaller device size or similar? If the MDT is formatted with a smaller size it could change the filesystem parameters (e.g. inode ratio, journal size, etc).

            Another possibility is that there was something else in the filesystem that caused it to run more slowly (e.g. files from some previous testing? Did you also format the OSTs identically for all of the tests?

            adilger Andreas Dilger added a comment - Is it possible you formatted the device differently when you ran the master test for 2.7? Maybe specifying a smaller device size or similar? If the MDT is formatted with a smaller size it could change the filesystem parameters (e.g. inode ratio, journal size, etc). Another possibility is that there was something else in the filesystem that caused it to run more slowly (e.g. files from some previous testing? Did you also format the OSTs identically for all of the tests?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Sorry I made a terrible mistake here. I reran the test but I didn't see any performance regression. I must have used different parameters when I was running the test against b2_5.

            		                             Create	Lookup	Md_getattr	Setxattr	Destroy
            
            b2_5:						
            	p3700_sc0_32t_32dir	141313.71	1265631.61	887451.61	94046.43	131052.37
            	p3700_sc0_64t_32dir	170025.57	1337869.1	892227.34	103872.16	138735.83
            	p3700_sc1_32t_32dir	48335.6	        1253868.18	876506.64	97518.53	32697.35
            	p3700_sc1_64t_32dir	39041.31	1257847.46	735008.57	101502.42	33928.42
            
            Master:						
            	p3700_sc0_32t_32dir	138848.11	1263207.89	865148.7	88467.54	129140.86
            	p3700_sc0_64t_32dir	149196.48	1335493.91	875077.15	95105.77	129005.38
            	p3700_sc1_32t_32dir	48971.7	        1237754.48	839588.74	97515.56	35833.2
            	p3700_sc1_64t_32dir	39285.28	1297680.39	839741.44	94257.2	        32877.89
            

            Here is the latest result. BTW, either reverting the patch or applying Di's patch didn't boost the performance.

            jay Jinshan Xiong (Inactive) added a comment - - edited Sorry I made a terrible mistake here. I reran the test but I didn't see any performance regression. I must have used different parameters when I was running the test against b2_5. Create Lookup Md_getattr Setxattr Destroy b2_5: p3700_sc0_32t_32dir 141313.71 1265631.61 887451.61 94046.43 131052.37 p3700_sc0_64t_32dir 170025.57 1337869.1 892227.34 103872.16 138735.83 p3700_sc1_32t_32dir 48335.6 1253868.18 876506.64 97518.53 32697.35 p3700_sc1_64t_32dir 39041.31 1257847.46 735008.57 101502.42 33928.42 Master: p3700_sc0_32t_32dir 138848.11 1263207.89 865148.7 88467.54 129140.86 p3700_sc0_64t_32dir 149196.48 1335493.91 875077.15 95105.77 129005.38 p3700_sc1_32t_32dir 48971.7 1237754.48 839588.74 97515.56 35833.2 p3700_sc1_64t_32dir 39285.28 1297680.39 839741.44 94257.2 32877.89 Here is the latest result. BTW, either reverting the patch or applying Di's patch didn't boost the performance.

            You could try reverting http://review.whamcloud.com/10376 which is reducing the max transaction size by 1/2.

            adilger Andreas Dilger added a comment - You could try reverting http://review.whamcloud.com/10376 which is reducing the max transaction size by 1/2.
            di.wang Di Wang added a comment -

            This is not a fix, but just remove those redundant stuff in the create object patch. Jinshan could you please try whether this patch can make master get back to 2.5. Thanks!

            di.wang Di Wang added a comment - This is not a fix, but just remove those redundant stuff in the create object patch. Jinshan could you please try whether this patch can make master get back to 2.5. Thanks!

            This issue may be related LU-5621

            jay Jinshan Xiong (Inactive) added a comment - This issue may be related LU-5621

            People

              wc-triage WC Triage
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: