Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15736

Commit for LU-14792 introduces client side mdtest file create/remove regression and high std dev

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While testing 2.15 and comparing it to our 2.12 branch, I observed a noticeable regression on the the following:

      • client side file create regression
      • client side 32K file remove regression
      • and all of the high std dev for creates/remove that we have been experiencing for creates/remove

       

      A git bisect revealed that this commit is the root cause (LU-14792):

      b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV 

       

      More details:

      commit b9c4dc3c33fe87ecaa79a290190524ea21b7fa8a
      Author: Lai Siyao <lai.siyao@whamcloud.com>
      Date:   Mon Jun 21 11:52:01 2021 +0800
       
       
          LU-14792 llite: enable filesystem-wide default LMV
          
          This change includes three parts:
          1. save dir depth to ROOT after lookup on client side.
          2. once space balanced default LMV is set on ROOT, and
             max-inherit/max-inherit-rr is unlimited or not less than directory
             depth, new directory will be created in QOS or roundrobin mode.
          3. set ROOT default LMV max-inherit unlimited, and max-inherit-rr to
             3, and increase the ratio to create subdirectory on local MDT with
             the directory depth to ROOT, so that new directories will be
             created by space usage, and the deeper it's located it's more
             likely to create on local MDTs; and the top 3 layer will be created
             in roundrobin mode if system is balanced.
          
          Set default LMV in mkdir_on_mdt() to make sure its subdirectories are
          created on the same MDT. Add sanity 413d.
          
          Create a test directory on MDT0 for pjdfstest, because cross-MDT
          rename of symlink will migrate symlink to target MDT, which will cause
          inode change (LU-11631).

       

      All commits before this look great. All commits after this exhibit the above symptoms.

      git log on master:

      4668283cd1 LU-14806 o2iblnd: clear fatal error on successful failover
      ---> introduces regression b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV
      ---> looks good b7bd4e3422 LU-14621 mdd: fix lock-tx order in mdd_xattr_merge()
      3e04b0fd6c LU-13417 mdd: set default LMV on ROOT
      4e05f3b70b (tag: v2_14_53, tag: 2.14.53) New tag 2.14.53

       

      Testing b7bd4e3422 (before patch):

      SUMMARY rate: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation        :     109280.683     100961.554     105818.622       3136.705
         Directory stat            :     410841.732     388930.761     404696.344       7969.689
         Directory removal         :     220323.614     150785.433     181709.288      25249.587
         File creation             :     154658.972     143961.530     149709.807       4125.522
         File stat                 :     700893.743     685670.701     692684.713       6583.956
         File read                 :     271890.920     183951.839     205427.555      33679.583
         File removal              :     147697.301     135354.855     140847.877       4338.359
         Tree creation             :        275.553        170.019        248.261         39.874
         Tree removal              :         99.770         85.408         91.795          5.479
       

       

      Testing b9c4dc3c33 (after patch):

      SUMMARY rate: (of 5 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         Directory creation        :     108068.523     102899.926     105606.738       2004.020
         Directory stat            :     428322.427     395826.681     404486.906      12222.136
         Directory removal         :     236153.570     146400.162     179968.138      32242.271
         File creation             :     156681.218     101096.295     122707.414      23848.521
         File stat                 :     689022.637     677108.079     683537.598       4706.503
         File read                 :     276963.750     184493.079     241172.371      30923.700
         File removal              :     148977.883     100569.361     123812.878      18654.554
         Tree creation             :        280.232          0.994        142.324        123.201
         Tree removal              :         99.952         20.766         57.230         35.277
       

       

      Again, every test run b9c4dc3c33 and after continues exhibiting the regressions and high deviations noted above. It varies from run to run but I can get regressions 15% or more for both file creates and file removes.

       

      mdtest script:

      #!/bin/bash
       
       
      NODES=21
      PPN=16
      PROCS=$(( $NODES * $PPN ))
      MDT_COUNT=1
      PAUSED=120
       
       
      # Unique directory #
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_uniq.out
       
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_uniq.out 
      
      
      # Shared directory #
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_shared.out
      
      srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_shared.out

       

      Attachments

        Issue Links

          Activity

            [LU-15736] Commit for LU-14792 introduces client side mdtest file create/remove regression and high std dev

            Unfortunately,  I am unable to reproduce the original issue. If/when I do, I will reopen the ticket.

            koutoupis Petros Koutoupis added a comment - Unfortunately,  I am unable to reproduce the original issue. If/when I do, I will reopen the ticket.

            Update - Despite seeing it on two separate systems at separate moments, I am now unable to reproduce the same issue once again (please refer to my previous post for details). I have been working with our internal architectural team to get a better understanding of why that is and experimenting with some of their suggestions in the hopes of resurfacing the original issue.

            koutoupis Petros Koutoupis added a comment - Update - Despite seeing it on two separate systems at separate moments, I am now unable to reproduce the same issue once again (please refer to my previous post for details). I have been working with our internal architectural team to get a better understanding of why that is and experimenting with some of their suggestions in the hopes of resurfacing the original issue.

            Andreas,

            Unfortunately, I am unable to reproduce this client-side issue. Once upon a time I observed it on two separate systems and both of those systems have since be reformatted and gone through other reconfiguration changes which implies that certain conditions need to be met (on the server-side) in order to experience the regression I have noted above in the description. Before these changes, the reproducibility was so consistent that I was able to root cause the issue to:

            b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV  

            Anyway, we have been working internally to understand:

            • What server-side conditions could have caused us to observe this regression in the first place?
            • And why are we not able to see it anymore?
            koutoupis Petros Koutoupis added a comment - Andreas, Unfortunately, I am unable to reproduce this client-side issue. Once upon a time I observed it on two separate systems and both of those systems have since be reformatted and gone through other reconfiguration changes which implies that certain conditions need to be met (on the server-side) in order to experience the regression I have noted above in the description. Before these changes, the reproducibility was so consistent that I was able to root cause the issue to: b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV Anyway, we have been working internally to understand: What server-side conditions could have caused us to observe this regression in the first place? And why are we not able to see it anymore?

            Andreas,

            The test directory "/mnt/kjlmo13/pkoutoupis/mdt0" is only tied to a single MDT and yes, there are two on the system.
            Ex. lfs mkdir -i 0 /mnt/kjlmo13/`whoami`/mdt0

            And yes, we have tested the second MDT and it shows worse performance in these areas although I am not entirely sure it is related (yet).

            I can gather the rest of that information and post it shortly.

            koutoupis Petros Koutoupis added a comment - Andreas, The test directory "/mnt/kjlmo13/pkoutoupis/mdt0" is only tied to a single MDT and yes, there are two on the system. Ex. lfs mkdir -i 0 /mnt/kjlmo13/`whoami`/mdt0 And yes, we have tested the second MDT and it shows worse performance in these areas although I am not entirely sure it is related (yet). I can gather the rest of that information and post it shortly.

            Petros, just to clarify, does the filesystem only have a single MDT, or is "MDTCOUNT=1" in the test config because the test directory "/mnt/kjlmo13/pkoutoupis/mdt0" is only using MDT0000, but there are actually multiple MDTs in the filesystem? Have you done any performance comparisons with multiple MDTs?

            It would be useful to collect the "lfs getdirstripe" and "lfs getdirstripe -D" for that directory, and then check during the test run (or disable the "unlink" "-r" phase) and then check the directory distribution of the per-thread directories across MDTs.

            adilger Andreas Dilger added a comment - Petros, just to clarify, does the filesystem only have a single MDT, or is " MDTCOUNT=1 " in the test config because the test directory " /mnt/kjlmo13/pkoutoupis/mdt0 " is only using MDT0000, but there are actually multiple MDTs in the filesystem? Have you done any performance comparisons with multiple MDTs? It would be useful to collect the " lfs getdirstripe " and " lfs getdirstripe -D " for that directory, and then check during the test run (or disable the "unlink" " -r " phase) and then check the directory distribution of the per-thread directories across MDTs.

            There may be some imbalance in the directory creation because it is the clients which decide which MDT to use at mkdir time. They initially start on different MDTs (essentially "NID % MDTCOUNT"), but this can become unsync'd if the clients are doing different things.

            On the flip side, mdtest runs for "unique dir" no longer need to manually set the directory layout for the output directory to use multiple MDTs. This would be $MDTCOUNT times faster for user applications where they do not manually set their own directory layout (ie. all of them, because users don't know about this).

            adilger Andreas Dilger added a comment - There may be some imbalance in the directory creation because it is the clients which decide which MDT to use at mkdir time. They initially start on different MDTs (essentially "NID % MDTCOUNT"), but this can become unsync'd if the clients are doing different things. On the flip side, mdtest runs for "unique dir" no longer need to manually set the directory layout for the output directory to use multiple MDTs. This would be $MDTCOUNT times faster for user applications where they do not manually set their own directory layout (ie. all of them, because users don't know about this).

            People

              wc-triage WC Triage
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: