Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15720

imbalanced file creation in 'crush' striped directory

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.16.0
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      performance regressions in stripe directory on 2.15.0 (commit;4d93fd7) were found against b2_14(commit:d4b9557).
      Here is configuration.

      4 x MDS (1 x MDT per MDS)
      4 x OSS (2 x OSS per OSS)
      40 x client
      
      [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/
      [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/
      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun --allow-run-as-root -oversubscribe -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 /work/tools/bin/mdtest -n 2000 -F -i 3 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/
      

      Here is test resutls.

      server: version=2.15.0_RC2_22_g4d93fd7
      client: version=2.15.0_RC2_22_g4d93fd7
      
      SUMMARY rate: (of 3 iterations)
         Operation                     Max            Min           Mean        Std Dev
         ---------                     ---            ---           ----        -------
         File creation              103733.203      76276.410      93728.713      15168.101
         File stat                  693152.731     656461.448     671671.960      19132.425
         File read                  259081.462     247951.008     253393.168       5569.308
         File removal               145137.390     142142.699     143590.068       1499.846
         Tree creation                  48.035          1.922         17.475         26.467
         Tree removal                   35.643         15.861         24.045         10.323
      
      server: version=2.14.0_21_gd4b9557
      client: version=2.14.0_21_gd4b9557
      
      SUMMARY rate: (of 3 iterations)
         Operation                     Max            Min           Mean        Std Dev
         ---------                     ---            ---           ----        -------
         File creation              138939.425      81336.388     117014.695      31167.261
         File stat                 1678888.952    1580356.340    1645190.276      56162.463
         File read                  569731.788     528830.155     546121.363      21170.387
         File removal               191837.291     186597.900     188595.661       2832.527
         Tree creation                 120.108          0.986         51.078         61.778
         Tree removal                   40.863         33.203         37.987          4.171
      

      As far as I observed this, it seems to be server side regression since because performance with lustre-2.15 clients + lustre-2.14 was ok below.

      server: version=2.14.0_21_gd4b9557
      client: version=2.15.0_RC2_22_g4d93fd7
      
      SUMMARY rate: (of 3 iterations)
         Operation                     Max            Min           Mean        Std Dev
         ---------                     ---            ---           ----        -------
         File creation              132009.360      74074.615     106514.108      29585.056
         File stat                 1570754.679    1457120.401    1532703.082      65457.038
         File read                  563710.286     540228.432     553871.772      12194.544
         File removal               189557.092     186065.253     187536.946       1809.374
         Tree creation                  54.678          1.883         19.576         30.399
         Tree removal                   42.065         41.677         41.875          0.194
      

      it seems that the following patch where regressions started.

          LU-14459 lmv: change default hash type to crush
          
          Change the default hash type to CRUSH to minimize the number
          of directory entries that need to be migrated.
      
      server: version=2.14.51_197_gf269497
      client: version=2.15.0_RC2_22_g4d93fd7
      
      SUMMARY rate: (of 3 iterations)
         Operation                     Max            Min           Mean        Std Dev
         ---------                     ---            ---           ----        -------
         File creation              148072.690      87600.145     127000.919      34149.618
         File stat                 1523849.471    1388808.972    1441253.182      72393.681
         File read                  562840.721     505515.837     538333.864      29552.364
         File removal               197259.873     191117.823     194934.244       3331.372
         Tree creation                 111.869          1.707         39.426         62.755
         Tree removal                   44.113         30.518         36.562          6.922
      
      server: version=2.14.2.14.51_198_gbb60caa
      client: version=2.15.0_RC2_22_g4d93fd7
      
      SUMMARY rate: (of 3 iterations)
         Operation                     Max            Min           Mean        Std Dev
         ---------                     ---            ---           ----        -------
         File creation               86531.781      63506.794      72790.003      12142.761
         File stat                  808075.643     746570.771     784071.104      32898.551
         File read                  260064.500     249212.881     256291.924       6135.058
         File removal               159592.539     155603.788     157752.556       2012.224
         Tree creation                 120.060          1.138         41.069         68.410
         Tree removal                   37.780         37.263         37.450          0.287
      V-1: Entering PrintTimestamp...
      

      I just found MDT load balancing seems to be not working well after patch. It's unbalanced file distribution across MDTs at create. For instance, here is just file creation test in a stripe directory.

      Before patch (commit:f269497)

      mpirunp -np 640 mdtest -n 2000 -F -C -i 1 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/
      
      [root@ec01 ~]# lfs df -i | grep MDT
      exafs-MDT0000_UUID      83050496      320298    82730198   1% /exafs[MDT:0] 
      exafs-MDT0001_UUID      83050496      320283    82730213   1% /exafs[MDT:1] 
      exafs-MDT0002_UUID      83050496      320334    82730162   1% /exafs[MDT:2] 
      exafs-MDT0003_UUID      83050496      320293    82730203   1% /exafs[MDT:3]  

      After patch (commit:bb60caa)

      [root@ec01 ~]# lfs df -i | grep MDT
      exafs-MDT0000_UUID      83050496      192404    82858092   1% /exafs[MDT:0] 
      exafs-MDT0001_UUID      83050496      190698    82859798   1% /exafs[MDT:1] 
      exafs-MDT0002_UUID      83050496      177266    82873230   1% /exafs[MDT:2] 
      exafs-MDT0003_UUID      83050496      720852    82329644   1% /exafs[MDT:3] 
      

      That's why mdtest's numbers was slower since one of MDS/MDT (MDT3 in this case) is more working longer than others. Eventually, mdtest's elapsed time is longer than balanced case.

      Attachments

        Issue Links

          Activity

            [LU-15720] imbalanced file creation in 'crush' striped directory

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55075
            Subject: LU-15720 dne: add crush2 hash type
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: b9efabbbc24267530ffe66d5e2078449c0b78a41

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55075 Subject: LU-15720 dne: add crush2 hash type Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b9efabbbc24267530ffe66d5e2078449c0b78a41
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47015/
            Subject: LU-15720 dne: add crush2 hash type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1ac4b9598ad6e2f94c4c672b4733186364255c6a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47015/ Subject: LU-15720 dne: add crush2 hash type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ac4b9598ad6e2f94c4c672b4733186364255c6a

            Test Configurations

            1 x MDS(1xMDT, 12 CPU cores, 142GB RAM)
            4 x OSS(2xOST/OSS)
            40 x client(16 CPU cores, 96GB RAM)
            IB-HDR100 network
            
            [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/
            [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/
            

            2.15.0_RC2_39_g42a6d1f (LU-15702 lov: remove lo_trunc_stripeno)

            [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
            SUMMARY rate: (of 1 iterations)
               Operation                     Max            Min           Mean        Std Dev
               ---------                     ---            ---           ----        -------
               File creation              111117.930     111117.930     111117.930          0.000
               File stat                       0.000          0.000          0.000          0.000
               File read                       0.000          0.000          0.000          0.000
               File removal                    0.000          0.000          0.000          0.000
               Tree creation                 135.910        135.910        135.910          0.000
               Tree removal                    0.000          0.000          0.000          0.000
            
            [root@ec01 ~]# lfs df -i | grep MDT
            exafs-MDT0000_UUID      83050496      192396    82858100   1% /exafs[MDT:0] 
            exafs-MDT0001_UUID      83050496      720843    82329653   1% /exafs[MDT:1] 
            exafs-MDT0002_UUID      83050496      177209    82873287   1% /exafs[MDT:2] 
            exafs-MDT0003_UUID      83050496      190695    82859801   1% /exafs[MDT:3] 
            

            2.15.0_RC2_40_g0090b6f LU-15692 lmv: change default hash back to fnv_1a_64

            [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
            SUMMARY rate: (of 1 iterations)
               Operation                     Max            Min           Mean        Std Dev
               ---------                     ---            ---           ----        -------
               File creation              150766.352     150766.352     150766.352          0.000
               File stat                       0.000          0.000          0.000          0.000
               File read                       0.000          0.000          0.000          0.000
               File removal                    0.000          0.000          0.000          0.000
               Tree creation                 153.942        153.942        153.942          0.000
               Tree removal                    0.000          0.000          0.000          0.000
            
            [root@ec01 ~]# lfs df -i | grep MDT
            exafs-MDT0000_UUID      83050496      320296    82730200   1% /exafs[MDT:0] 
            exafs-MDT0001_UUID      83050496      320282    82730214   1% /exafs[MDT:1] 
            exafs-MDT0002_UUID      83050496      320285    82730211   1% /exafs[MDT:2] 
            exafs-MDT0003_UUID      83050496      320283    82730213   1% /exafs[MDT:3] 
            

            2.15.0_RC3 + https://review.whamcloud.com/47015

            [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
            SUMMARY rate: (of 1 iterations)
               Operation                     Max            Min           Mean        Std Dev
               ---------                     ---            ---           ----        -------
               File creation              149746.028     149746.028     149746.028          0.000
               File stat                       0.000          0.000          0.000          0.000
               File read                       0.000          0.000          0.000          0.000
               File removal                    0.000          0.000          0.000          0.000
               Tree creation                  14.232         14.232         14.232          0.000
               Tree removal                    0.000          0.000          0.000          0.000
            
            [root@ec01 ~]# lfs df -i | grep MDT
            exafs-MDT0000_UUID      83050496      320296    82730200   1% /exafs[MDT:0] 
            exafs-MDT0001_UUID      83050496      320289    82730207   1% /exafs[MDT:1] 
            exafs-MDT0002_UUID      83050496      320289    82730207   1% /exafs[MDT:2] 
            exafs-MDT0003_UUID      83050496      320290    82730206   1% /exafs[MDT:3] 
            

            patch https://review.whamcloud.com/47015 is working fine for rule of mdtest's files and no performance degradations found.

            sihara Shuichi Ihara added a comment - Test Configurations 1 x MDS(1xMDT, 12 CPU cores, 142GB RAM) 4 x OSS(2xOST/OSS) 40 x client(16 CPU cores, 96GB RAM) IB-HDR100 network [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/ 2.15.0_RC2_39_g42a6d1f ( LU-15702 lov: remove lo_trunc_stripeno) [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 111117.930 111117.930 111117.930 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 135.910 135.910 135.910 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 192396 82858100 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 720843 82329653 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 177209 82873287 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 190695 82859801 1% /exafs[MDT:3] 2.15.0_RC2_40_g0090b6f LU-15692 lmv: change default hash back to fnv_1a_64 [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 150766.352 150766.352 150766.352 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 153.942 153.942 153.942 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320296 82730200 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320282 82730214 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320285 82730211 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320283 82730213 1% /exafs[MDT:3] 2.15.0_RC3 + https://review.whamcloud.com/47015 [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 149746.028 149746.028 149746.028 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 14.232 14.232 14.232 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320296 82730200 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320289 82730207 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320289 82730207 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320290 82730206 1% /exafs[MDT:3] patch https://review.whamcloud.com/47015 is working fine for rule of mdtest's files and no performance degradations found.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47015
            Subject: LU-15720 dne: add crush2 hash type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 13296e6231e07ca615d1e709b7c579b4878e1f16

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47015 Subject: LU-15720 dne: add crush2 hash type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 13296e6231e07ca615d1e709b7c579b4878e1f16

            two things about the "temp" filenames:

            • files should still be created on same MDT as the "non-temp" filename. However, if the filename is like "foo.12345678" then the hashing code will only ever use "foo" to determine the "proper" MDT index.
            • the "temp filename" code should not consider suffixes with only numbers as a temp filename. That is specifically to avoid putting all "foo.nnnnnnnn" filenames on the same MDT. However, if there is a mix of numbers and letters (e.g. hex suffix?) then it might be doing the wrong thing.

            The version of mdtest that I'm using locally only has numbers in the suffix:

            # ls mdtest-easy/test-dir.0-0/mdtest_tree.0.0
            file.mdtest.1.127
            file.mdtest.1.128
            file.mdtest.1.129
            file.mdtest.1.13
            file.mdtest.1.130
            file.mdtest.1.131
            file.mdtest.1.132
            

            However, it might be getting confused by the extra '.' in the name if there are more files, like "file.mdtest.1.345678" or "file.mdtest.12.45678"? This would incorrectly fail the "(digit >= suffixlen -1)" check because the second '.' is not counted in digit or upper or lower. There should probably be an additional check that there aren't non-alphanumeric characters in the suffix:

                    if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
                        upper == suffixlen || lower == suffixlen)
                            return false;
                    if (type == LMV_HASH_TYPE_CRUSH2 && digit + upper + lower != suffixlen)
                            return false;
            

            Unfortunately, this changes the hash function subtly, so a new "LMV_HASH_TYPE_CRUSH2" hash type is needed for the new behavior. Otherwise, clients may think they know which MDT a particular filename is on but it would be wrong.

            I'm having a hard time to convince myself this is code correct. The comment in the commit message says:

            LU-13481 dne: improve temp file name check
            
            Previously if all but two characters in file name suffix are digit,
            it's not treated as temp file, as is too strict if suffix length is
            short, e.g. 6. Change it to allow one character, and this non-digit
            character should not be the starting character.
            

            Besides the problem with ".-_" characters in the suffix (which would make count of digits/upper/lower too small and fail the suffixlen check), it doesn't look like the isdigit() check is correct:

                    if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
                        upper == suffixlen || lower == suffixlen)
                            return false;
            

            If "digit >= suffixlen -1" (say name = "foo.12345678", digit = 8, suffixlen = 8) this check will fail (and return "true" for the temp filename check) because "1" is a digit. I think this is supposed to be just "isdigit(name[])" (no '!').
            Definitely "foo.12345678" should not be considered a temp file, since this is a common case (eg file.YYYYMMDD). The chance of a 6-number temp file is 1/40k, and an 8-number temp file being hit randomly is less than 1/2M.

            The original code even considered 6/6, 5/6 and 4/6 numbers to not be temp files (ie. "digit >= suffixlen - 2") , but 4/6 numbers was too easily hit by mktemp. It was supposed to keep 8/8 and 7/8 as non-temp files as long as 7/8 was like "file.f1234567". The problem is that the 8/8 case also fails because the first char is a digit, so "!isdigit(name[namelen-suffixlen])" fails, and it doesn't matter if the "(digit >= suffixlen - 1)" part is true or not because the "false" check is not met, and "true" is returned.

            The proper check should be something like:

                    if (digit == suffixlen || upper == suffixlen || lower == suffixlen ||
                        (digit == suffixlen - 1 && !isdigit(name[namelen - suffixlen])))
                            return false;
            
            adilger Andreas Dilger added a comment - two things about the "temp" filenames: files should still be created on same MDT as the "non-temp" filename. However, if the filename is like " foo.12345678 " then the hashing code will only ever use " foo " to determine the "proper" MDT index. the "temp filename" code should not consider suffixes with only numbers as a temp filename. That is specifically to avoid putting all " foo.nnnnnnnn " filenames on the same MDT. However, if there is a mix of numbers and letters (e.g. hex suffix?) then it might be doing the wrong thing. The version of mdtest that I'm using locally only has numbers in the suffix: # ls mdtest-easy/test-dir.0-0/mdtest_tree.0.0 file.mdtest.1.127 file.mdtest.1.128 file.mdtest.1.129 file.mdtest.1.13 file.mdtest.1.130 file.mdtest.1.131 file.mdtest.1.132 However, it might be getting confused by the extra ' . ' in the name if there are more files, like " file.mdtest.1.345678 " or " file.mdtest.12.45678 "? This would incorrectly fail the " (digit >= suffixlen -1) " check because the second ' . ' is not counted in digit or upper or lower . There should probably be an additional check that there aren't non-alphanumeric characters in the suffix: if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) || upper == suffixlen || lower == suffixlen) return false ; if (type == LMV_HASH_TYPE_CRUSH2 && digit + upper + lower != suffixlen) return false ; Unfortunately, this changes the hash function subtly, so a new " LMV_HASH_TYPE_CRUSH2 " hash type is needed for the new behavior. Otherwise, clients may think they know which MDT a particular filename is on but it would be wrong. I'm having a hard time to convince myself this is code correct. The comment in the commit message says: LU-13481 dne: improve temp file name check Previously if all but two characters in file name suffix are digit, it's not treated as temp file, as is too strict if suffix length is short, e.g. 6. Change it to allow one character, and this non-digit character should not be the starting character. Besides the problem with ".-_" characters in the suffix (which would make count of digits/upper/lower too small and fail the suffixlen check), it doesn't look like the isdigit() check is correct: if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) || upper == suffixlen || lower == suffixlen) return false ; If "digit >= suffixlen -1" (say name = "foo.12345678", digit = 8, suffixlen = 8) this check will fail (and return "true" for the temp filename check) because "1" is a digit. I think this is supposed to be just "isdigit(name[])" (no '!'). Definitely "foo.12345678" should not be considered a temp file, since this is a common case (eg file.YYYYMMDD). The chance of a 6-number temp file is 1/40k, and an 8-number temp file being hit randomly is less than 1/2M. The original code even considered 6/6, 5/6 and 4/6 numbers to not be temp files (ie. "digit >= suffixlen - 2") , but 4/6 numbers was too easily hit by mktemp. It was supposed to keep 8/8 and 7/8 as non-temp files as long as 7/8 was like "file.f1234567". The problem is that the 8/8 case also fails because the first char is a digit, so " !isdigit(name [namelen-suffixlen] ) " fails, and it doesn't matter if the " (digit >= suffixlen - 1) " part is true or not because the "false" check is not met, and "true" is returned. The proper check should be something like: if (digit == suffixlen || upper == suffixlen || lower == suffixlen || (digit == suffixlen - 1 && !isdigit(name[namelen - suffixlen]))) return false ;

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: