Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15736

Commit for LU-14792 introduces client side mdtest file create/remove regression and high std dev

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While testing 2.15 and comparing it to our 2.12 branch, I observed a noticeable regression on the the following:

      • client side file create regression
      • client side 32K file remove regression
      • and all of the high std dev for creates/remove that we have been experiencing for creates/remove

       

      A git bisect revealed that this commit is the root cause (LU-14792):

      b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV 

       

      More details:

      commit b9c4dc3c33fe87ecaa79a290190524ea21b7fa8a
      Author: Lai Siyao <lai.siyao@whamcloud.com>
      Date:   Mon Jun 21 11:52:01 2021 +0800
       
       
          LU-14792 llite: enable filesystem-wide default LMV
          
          This change includes three parts:
          1. save dir depth to ROOT after lookup on client side.
          2. once space balanced default LMV is set on ROOT, and
             max-inherit/max-inherit-rr is unlimited or not less than directory
             depth, new directory will be created in QOS or roundrobin mode.
          3. set ROOT default LMV max-inherit unlimited, and max-inherit-rr to
             3, and increase the ratio to create subdirectory on local MDT with
             the directory depth to ROOT, so that new directories will be
             created by space usage, and the deeper it's located it's more
             likely to create on local MDTs; and the top 3 layer will be created
             in roundrobin mode if system is balanced.
          
          Set default LMV in mkdir_on_mdt() to make sure its subdirectories are
          created on the same MDT. Add sanity 413d.
          
          Create a test directory on MDT0 for pjdfstest, because cross-MDT
          rename of symlink will migrate symlink to target MDT, which will cause
          inode change (LU-11631).

       

      All commits before this look great. All commits after this exhibit the above symptoms.

      git log on master:

      4668283cd1 LU-14806 o2iblnd: clear fatal error on successful failover
      ---> introduces regression b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV
      ---> looks good b7bd4e3422 LU-14621 mdd: fix lock-tx order in mdd_xattr_merge()
      3e04b0fd6c LU-13417 mdd: set default LMV on ROOT
      4e05f3b70b (tag: v2_14_53, tag: 2.14.53) New tag 2.14.53

       

      Testing b7bd4e3422 (before patch):

      SUMMARY rate: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation        :     109280.683     100961.554     105818.622       3136.705
         Directory stat            :     410841.732     388930.761     404696.344       7969.689
         Directory removal         :     220323.614     150785.433     181709.288      25249.587
         File creation             :     154658.972     143961.530     149709.807       4125.522
         File stat                 :     700893.743     685670.701     692684.713       6583.956
         File read                 :     271890.920     183951.839     205427.555      33679.583
         File removal              :     147697.301     135354.855     140847.877       4338.359
         Tree creation             :        275.553        170.019        248.261         39.874
         Tree removal              :         99.770         85.408         91.795          5.479
       

       

      Testing b9c4dc3c33 (after patch):

      SUMMARY rate: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation        :     108068.523     102899.926     105606.738       2004.020
         Directory stat            :     428322.427     395826.681     404486.906      12222.136
         Directory removal         :     236153.570     146400.162     179968.138      32242.271
         File creation             :     156681.218     101096.295     122707.414      23848.521
         File stat                 :     689022.637     677108.079     683537.598       4706.503
         File read                 :     276963.750     184493.079     241172.371      30923.700
         File removal              :     148977.883     100569.361     123812.878      18654.554
         Tree creation             :        280.232          0.994        142.324        123.201
         Tree removal              :         99.952         20.766         57.230         35.277
       

       

      Again, every test run b9c4dc3c33 and after continues exhibiting the regressions and high deviations noted above. It varies from run to run but I can get regressions 15% or more for both file creates and file removes.

       

      mdtest script:

      #!/bin/bash
       
       
      NODES=21
      PPN=16
      PROCS=$(( $NODES * $PPN ))
      MDT_COUNT=1
      PAUSED=120
       
       
      # Unique directory #
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_uniq.out
       
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_uniq.out 
      
      
      # Shared directory #
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_shared.out
      
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_shared.out

       

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: