[LU-15720] imbalanced file creation in 'crush' striped directory Created: 05/Apr/22  Updated: 30/Sep/22  Resolved: 11/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Blocker
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Cloners
Clones LU-15692 performance regressions for files in ... Resolved
Related
is related to LU-15479 sanity: test_316 failed lfs mv: /mnt/... Open
is related to LU-15546 Shared Directory File Creates regress... Resolved
is related to LU-13481 sanity test_33h: MDT index mismatch 5... Resolved
is related to LU-11025 DNE3: directory restripe Resolved
is related to LU-14459 DNE3: directory auto split during create Open
is related to LU-16198 sanity test_33hh: MDT index match 49/... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

performance regressions in stripe directory on 2.15.0 (commit;4d93fd7) were found against b2_14(commit:d4b9557).
Here is configuration.

4 x MDS (1 x MDT per MDS)
4 x OSS (2 x OSS per OSS)
40 x client

[root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/
[root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun --allow-run-as-root -oversubscribe -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 /work/tools/bin/mdtest -n 2000 -F -i 3 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/

Here is test resutls.

server: version=2.15.0_RC2_22_g4d93fd7
client: version=2.15.0_RC2_22_g4d93fd7

SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              103733.203      76276.410      93728.713      15168.101
   File stat                  693152.731     656461.448     671671.960      19132.425
   File read                  259081.462     247951.008     253393.168       5569.308
   File removal               145137.390     142142.699     143590.068       1499.846
   Tree creation                  48.035          1.922         17.475         26.467
   Tree removal                   35.643         15.861         24.045         10.323
server: version=2.14.0_21_gd4b9557
client: version=2.14.0_21_gd4b9557

SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              138939.425      81336.388     117014.695      31167.261
   File stat                 1678888.952    1580356.340    1645190.276      56162.463
   File read                  569731.788     528830.155     546121.363      21170.387
   File removal               191837.291     186597.900     188595.661       2832.527
   Tree creation                 120.108          0.986         51.078         61.778
   Tree removal                   40.863         33.203         37.987          4.171

As far as I observed this, it seems to be server side regression since because performance with lustre-2.15 clients + lustre-2.14 was ok below.

server: version=2.14.0_21_gd4b9557
client: version=2.15.0_RC2_22_g4d93fd7

SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              132009.360      74074.615     106514.108      29585.056
   File stat                 1570754.679    1457120.401    1532703.082      65457.038
   File read                  563710.286     540228.432     553871.772      12194.544
   File removal               189557.092     186065.253     187536.946       1809.374
   Tree creation                  54.678          1.883         19.576         30.399
   Tree removal                   42.065         41.677         41.875          0.194

it seems that the following patch where regressions started.

    LU-14459 lmv: change default hash type to crush
    
    Change the default hash type to CRUSH to minimize the number
    of directory entries that need to be migrated.
server: version=2.14.51_197_gf269497
client: version=2.15.0_RC2_22_g4d93fd7

SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              148072.690      87600.145     127000.919      34149.618
   File stat                 1523849.471    1388808.972    1441253.182      72393.681
   File read                  562840.721     505515.837     538333.864      29552.364
   File removal               197259.873     191117.823     194934.244       3331.372
   Tree creation                 111.869          1.707         39.426         62.755
   Tree removal                   44.113         30.518         36.562          6.922
server: version=2.14.2.14.51_198_gbb60caa
client: version=2.15.0_RC2_22_g4d93fd7

SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation               86531.781      63506.794      72790.003      12142.761
   File stat                  808075.643     746570.771     784071.104      32898.551
   File read                  260064.500     249212.881     256291.924       6135.058
   File removal               159592.539     155603.788     157752.556       2012.224
   Tree creation                 120.060          1.138         41.069         68.410
   Tree removal                   37.780         37.263         37.450          0.287
V-1: Entering PrintTimestamp...

I just found MDT load balancing seems to be not working well after patch. It's unbalanced file distribution across MDTs at create. For instance, here is just file creation test in a stripe directory.

Before patch (commit:f269497)

mpirunp -np 640 mdtest -n 2000 -F -C -i 1 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/

[root@ec01 ~]# lfs df -i | grep MDT
exafs-MDT0000_UUID      83050496      320298    82730198   1% /exafs[MDT:0] 
exafs-MDT0001_UUID      83050496      320283    82730213   1% /exafs[MDT:1] 
exafs-MDT0002_UUID      83050496      320334    82730162   1% /exafs[MDT:2] 
exafs-MDT0003_UUID      83050496      320293    82730203   1% /exafs[MDT:3]  

After patch (commit:bb60caa)

[root@ec01 ~]# lfs df -i | grep MDT
exafs-MDT0000_UUID      83050496      192404    82858092   1% /exafs[MDT:0] 
exafs-MDT0001_UUID      83050496      190698    82859798   1% /exafs[MDT:1] 
exafs-MDT0002_UUID      83050496      177266    82873230   1% /exafs[MDT:2] 
exafs-MDT0003_UUID      83050496      720852    82329644   1% /exafs[MDT:3] 

That's why mdtest's numbers was slower since one of MDS/MDT (MDT3 in this case) is more working longer than others. Eventually, mdtest's elapsed time is longer than balanced case.



 Comments   
Comment by Andreas Dilger [ 05/Apr/22 ]

two things about the "temp" filenames:

  • files should still be created on same MDT as the "non-temp" filename. However, if the filename is like "foo.12345678" then the hashing code will only ever use "foo" to determine the "proper" MDT index.
  • the "temp filename" code should not consider suffixes with only numbers as a temp filename. That is specifically to avoid putting all "foo.nnnnnnnn" filenames on the same MDT. However, if there is a mix of numbers and letters (e.g. hex suffix?) then it might be doing the wrong thing.

The version of mdtest that I'm using locally only has numbers in the suffix:

# ls mdtest-easy/test-dir.0-0/mdtest_tree.0.0
file.mdtest.1.127
file.mdtest.1.128
file.mdtest.1.129
file.mdtest.1.13
file.mdtest.1.130
file.mdtest.1.131
file.mdtest.1.132

However, it might be getting confused by the extra '.' in the name if there are more files, like "file.mdtest.1.345678" or "file.mdtest.12.45678"? This would incorrectly fail the "(digit >= suffixlen -1)" check because the second '.' is not counted in digit or upper or lower. There should probably be an additional check that there aren't non-alphanumeric characters in the suffix:

        if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
            upper == suffixlen || lower == suffixlen)
                return false;
        if (type == LMV_HASH_TYPE_CRUSH2 && digit + upper + lower != suffixlen)
                return false;

Unfortunately, this changes the hash function subtly, so a new "LMV_HASH_TYPE_CRUSH2" hash type is needed for the new behavior. Otherwise, clients may think they know which MDT a particular filename is on but it would be wrong.

I'm having a hard time to convince myself this is code correct. The comment in the commit message says:

LU-13481 dne: improve temp file name check

Previously if all but two characters in file name suffix are digit,
it's not treated as temp file, as is too strict if suffix length is
short, e.g. 6. Change it to allow one character, and this non-digit
character should not be the starting character.

Besides the problem with ".-_" characters in the suffix (which would make count of digits/upper/lower too small and fail the suffixlen check), it doesn't look like the isdigit() check is correct:

        if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
            upper == suffixlen || lower == suffixlen)
                return false;

If "digit >= suffixlen -1" (say name = "foo.12345678", digit = 8, suffixlen = 8) this check will fail (and return "true" for the temp filename check) because "1" is a digit. I think this is supposed to be just "isdigit(name[])" (no '!').
Definitely "foo.12345678" should not be considered a temp file, since this is a common case (eg file.YYYYMMDD). The chance of a 6-number temp file is 1/40k, and an 8-number temp file being hit randomly is less than 1/2M.

The original code even considered 6/6, 5/6 and 4/6 numbers to not be temp files (ie. "digit >= suffixlen - 2") , but 4/6 numbers was too easily hit by mktemp. It was supposed to keep 8/8 and 7/8 as non-temp files as long as 7/8 was like "file.f1234567". The problem is that the 8/8 case also fails because the first char is a digit, so "!isdigit(name[namelen-suffixlen])" fails, and it doesn't matter if the "(digit >= suffixlen - 1)" part is true or not because the "false" check is not met, and "true" is returned.

The proper check should be something like:

        if (digit == suffixlen || upper == suffixlen || lower == suffixlen ||
            (digit == suffixlen - 1 && !isdigit(name[namelen - suffixlen])))
                return false;
Comment by Gerrit Updater [ 08/Apr/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47015
Subject: LU-15720 dne: add crush2 hash type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13296e6231e07ca615d1e709b7c579b4878e1f16

Comment by Shuichi Ihara [ 13/Apr/22 ]

Test Configurations

1 x MDS(1xMDT, 12 CPU cores, 142GB RAM)
4 x OSS(2xOST/OSS)
40 x client(16 CPU cores, 96GB RAM)
IB-HDR100 network

[root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/
[root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/

2.15.0_RC2_39_g42a6d1f (LU-15702 lov: remove lo_trunc_stripeno)

[root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
SUMMARY rate: (of 1 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              111117.930     111117.930     111117.930          0.000
   File stat                       0.000          0.000          0.000          0.000
   File read                       0.000          0.000          0.000          0.000
   File removal                    0.000          0.000          0.000          0.000
   Tree creation                 135.910        135.910        135.910          0.000
   Tree removal                    0.000          0.000          0.000          0.000

[root@ec01 ~]# lfs df -i | grep MDT
exafs-MDT0000_UUID      83050496      192396    82858100   1% /exafs[MDT:0] 
exafs-MDT0001_UUID      83050496      720843    82329653   1% /exafs[MDT:1] 
exafs-MDT0002_UUID      83050496      177209    82873287   1% /exafs[MDT:2] 
exafs-MDT0003_UUID      83050496      190695    82859801   1% /exafs[MDT:3] 

2.15.0_RC2_40_g0090b6f LU-15692 lmv: change default hash back to fnv_1a_64

[root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
SUMMARY rate: (of 1 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              150766.352     150766.352     150766.352          0.000
   File stat                       0.000          0.000          0.000          0.000
   File read                       0.000          0.000          0.000          0.000
   File removal                    0.000          0.000          0.000          0.000
   Tree creation                 153.942        153.942        153.942          0.000
   Tree removal                    0.000          0.000          0.000          0.000

[root@ec01 ~]# lfs df -i | grep MDT
exafs-MDT0000_UUID      83050496      320296    82730200   1% /exafs[MDT:0] 
exafs-MDT0001_UUID      83050496      320282    82730214   1% /exafs[MDT:1] 
exafs-MDT0002_UUID      83050496      320285    82730211   1% /exafs[MDT:2] 
exafs-MDT0003_UUID      83050496      320283    82730213   1% /exafs[MDT:3] 

2.15.0_RC3 + https://review.whamcloud.com/47015

[root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C
SUMMARY rate: (of 1 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation              149746.028     149746.028     149746.028          0.000
   File stat                       0.000          0.000          0.000          0.000
   File read                       0.000          0.000          0.000          0.000
   File removal                    0.000          0.000          0.000          0.000
   Tree creation                  14.232         14.232         14.232          0.000
   Tree removal                    0.000          0.000          0.000          0.000

[root@ec01 ~]# lfs df -i | grep MDT
exafs-MDT0000_UUID      83050496      320296    82730200   1% /exafs[MDT:0] 
exafs-MDT0001_UUID      83050496      320289    82730207   1% /exafs[MDT:1] 
exafs-MDT0002_UUID      83050496      320289    82730207   1% /exafs[MDT:2] 
exafs-MDT0003_UUID      83050496      320290    82730206   1% /exafs[MDT:3] 

patch https://review.whamcloud.com/47015 is working fine for rule of mdtest's files and no performance degradations found.

Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47015/
Subject: LU-15720 dne: add crush2 hash type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1ac4b9598ad6e2f94c4c672b4733186364255c6a

Comment by Peter Jones [ 11/Jul/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:20:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.