[LU-15692] performance regressions for files in stripe directory Created: 25/Mar/22 Updated: 14/Sep/22 Resolved: 02/Apr/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Shuichi Ihara | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
performance regressions in stripe directory on 2.15.0 (commit;4d93fd7) were found against b2_14(commit:d4b9557). 4 x MDS (1 x MDT per MDS) 4 x OSS (2 x OSS per OSS) 40 x client [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun --allow-run-as-root -oversubscribe -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 /work/tools/bin/mdtest -n 2000 -F -i 3 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/ Here is test resutls. server: version=2.15.0_RC2_22_g4d93fd7 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 103733.203 76276.410 93728.713 15168.101 File stat 693152.731 656461.448 671671.960 19132.425 File read 259081.462 247951.008 253393.168 5569.308 File removal 145137.390 142142.699 143590.068 1499.846 Tree creation 48.035 1.922 17.475 26.467 Tree removal 35.643 15.861 24.045 10.323 server: version=2.14.0_21_gd4b9557 client: version=2.14.0_21_gd4b9557 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 138939.425 81336.388 117014.695 31167.261 File stat 1678888.952 1580356.340 1645190.276 56162.463 File read 569731.788 528830.155 546121.363 21170.387 File removal 191837.291 186597.900 188595.661 2832.527 Tree creation 120.108 0.986 51.078 61.778 Tree removal 40.863 33.203 37.987 4.171 As far as I observed this, it seems to be server side regression since because performance with lustre-2.15 clients + lustre-2.14 was ok below. server: version=2.14.0_21_gd4b9557 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 132009.360 74074.615 106514.108 29585.056 File stat 1570754.679 1457120.401 1532703.082 65457.038 File read 563710.286 540228.432 553871.772 12194.544 File removal 189557.092 186065.253 187536.946 1809.374 Tree creation 54.678 1.883 19.576 30.399 Tree removal 42.065 41.677 41.875 0.194 I am running 'git bisect', can hopefully find an commit where started regression soon. |
| Comments |
| Comment by Andreas Dilger [ 25/Mar/22 ] |
|
Tentatively adding 2.15.0 fix version for tracking until we understand this better. |
| Comment by Shuichi Ihara [ 25/Mar/22 ] |
|
it seems that the following patch where regressions started. LU-14459 lmv: change default hash type to crush
Change the default hash type to CRUSH to minimize the number
of directory entries that need to be migrated.
server: version=2.14.51_197_gf269497 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 148072.690 87600.145 127000.919 34149.618 File stat 1523849.471 1388808.972 1441253.182 72393.681 File read 562840.721 505515.837 538333.864 29552.364 File removal 197259.873 191117.823 194934.244 3331.372 Tree creation 111.869 1.707 39.426 62.755 Tree removal 44.113 30.518 36.562 6.922 server: version=2.14.2.14.51_198_gbb60caa client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 86531.781 63506.794 72790.003 12142.761 File stat 808075.643 746570.771 784071.104 32898.551 File read 260064.500 249212.881 256291.924 6135.058 File removal 159592.539 155603.788 157752.556 2012.224 Tree creation 120.060 1.138 41.069 68.410 Tree removal 37.780 37.263 37.450 0.287 V-1: Entering PrintTimestamp... Lustre client didn't change version and server version only changed, but does it make sense patch https://review.whamcloud.com/#/c/43684/ impact for server as well? |
| Comment by Andreas Dilger [ 25/Mar/22 ] |
|
It looks like LMV_HASH_TYPE_DEFAULT is used on both the client and server, so patch https://review.whamcloud.com/43684 "LU-14459 lmv: change default hash type to crush" would definitely affect behavior of the server if the client is not specifying the directory hash type (which it is not). |
| Comment by Lai Siyao [ 25/Mar/22 ] |
|
Ihara, can you collect flamegraph on MDS for both cases? |
| Comment by Andreas Dilger [ 25/Mar/22 ] |
|
Shuichi, do you have any idea of why this is slower on the server? Higher CPU usage, uneven file distribution across MDTs, something else? If high MDS CPU usage, would you be able to collect a flame graph of the before/after the patch? The CRUSH hash is a bit more CPU intensive, but I wouldn't think it would hurt performance by 50%, but it would be good to know where the performance is lost so possibly it can be optimized. |
| Comment by Shuichi Ihara [ 25/Mar/22 ] |
|
Before collecting flamegraph or other information in detail, I just found MDT load balancing seems to be not working well after patch. It's unbalanced file distribution across MDTs at create. Before patch (commit:f269497) mpirunp -np 640 mdtest -n 2000 -F -C -i 1 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320298 82730198 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320283 82730213 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320334 82730162 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320293 82730203 1% /exafs[MDT:3] After patch (commit:bb60caa) [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 192404 82858092 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 190698 82859798 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 177266 82873230 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 720852 82329644 1% /exafs[MDT:3] That's why mdtest's numbers was slower since one of MDS/MDT (MDT3 in this case) is more working longer than others. Eventually, mdtest's elapsed time is longer than balanced case. |
| Comment by Andreas Dilger [ 28/Mar/22 ] |
|
Lai, can you please look into the CRUSH hash to see if this is causing imbalance, and why. It would be possibly to just create a bunch of names in the format of mdtest and count the distribution of files. One thing that might be happening is that the format of the mdtest filenames may be triggering the "no rename across MDTs" heuristic for temp files if they are of the form .xxxxxxxx at the end? |
| Comment by Andreas Dilger [ 29/Mar/22 ] |
|
Shuichi, it should be possible to see if changing the hash function back to the old one solves the problem. Use: # lfs mkdir -H fnv_1a_64 /exafs/d0/d1/d2/mdt_stripe # lfs setdirstripe -H fnv_1a_64 -D /exafs/d0/d1/d2/mdt_stripe It is easy enough to make a patch that will change the default hash function back to fnv_1a_64 for 2.15.0, but it would be preferable to understand why the performance/balance is so bad because this hash function should be better for directory migration/restriping. Also, did you give patch https://review.whamcloud.com/46696 " |
| Comment by Gerrit Updater [ 30/Mar/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46950 |
| Comment by Lai Siyao [ 30/Mar/22 ] |
|
If the testing system is fresh new, /exafs/d0/d1/d2/mdt_stripe/ will be created on MDT3 (because max-inherit-rr is 3 by default, d0 will be created on MDT0, d1 on MDT1, d2 on MDT2), and the new directories created by mdtest will all be located on MDT3 (master inode) because at this level max-inherit-rr doesn't work any more. This explains why there are far more inodes used on MDT3. But I don't understand why the inodes are used equally before this patch, will you count the number of directories on each MDT? You can get it via "for i in 0 1 2 3; do lfs getstripe -m /exafs/d0/d1/d2/mdt_stripe/* | grep -c $i; done". |
| Comment by Gerrit Updater [ 30/Mar/22 ] |
|
|
| Comment by Shuichi Ihara [ 30/Mar/22 ] |
|
Andreas, "fnv_1a_64" worked fine below. [root@ec01 ~]# lfs mkdir -H fnv_1a_64 /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# lfs setdirstripe -c 4 -H fnv_1a_64 -D /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# lfs getdirstripe /exafs/d0/d1/d2/mdt_stripe lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none [root@ec01 ~]# lfs getdirstripe -D /exafs/d0/d1/d2/mdt_stripe lmv_stripe_count: 4 lmv_stripe_offset: -1 lmv_hash_type: fnv_1a_64 lmv_max_inherit: 3 lmv_max_inherit_rr: 0 [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 146900.918 146900.918 146900.918 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 34.638 34.638 34.638 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320295 82730201 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320290 82730206 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320285 82730211 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320284 82730212 1% /exafs[MDT:3] When I did test "fnv_1a_64" before to check, it didn't work becouse I did run another "lfs setdirstirpe -D" after "lfs setdirstripe -H fnv_1a_64". e.g. [root@ec01 ~]# lfs mkdir -H fnv_1a_64 /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# lfs setdirstripe -H fnv_1a_64 -D /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# lfs getdirstripe -D /exafs/d0/d1/d2/mdt_stripe lmv_stripe_count: 4 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0 stripe count is configured properly, but hash_type changed back to default CRUSH algorithm.
let me test patch too. |
| Comment by Shuichi Ihara [ 30/Mar/22 ] |
Lai, /exafs/d0/d1/d2/mdt_stripe/ is a striped directory on across all MDTs. I don't think inherit-rr is related in this case, no? [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs getdirstripe -D /exafs/d0/d1/d2/mdt_stripe lmv_stripe_count: 4 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0 |
| Comment by Lai Siyao [ 31/Mar/22 ] |
[root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/ According to the description, mdt_stripe is created with "mkdir", it should be a plain directory. And it has a default LMV to stripe to all MDTs. |
| Comment by Andreas Dilger [ 31/Mar/22 ] |
|
I don't think the MDT/striping of /exafs/d0/d1/d2/mdt_stripe/ itself is not so critical, because it is only holding another level of subdirectories in it, like /exafs/d0/d1/d2/mdt_stripe/test-dir.0-0/mdtest_tree.M.N/ (for different client nodes and MPI ranks), and those should be striped across all MDTs. Rather than checking the MDT index of specific directories, it would probably be better to check all subdirectories like: # lfs find /exafs/d0/d1/d2/mdt_stripe -type d | xargs lfs getdirstripe -m | sort | uniq -c That would at least print the starting MDT index for each directory, and presumably all of those directories should be striped over all 4 MDTs. This would be useful to check in the crush hash test if the imbalance is because of directory imbalance (e.g. 3/4 of subdirectories are created on MDT0003), or if it is a file imbalance within a single directory (e.g. 3/4 of files in every directory are created on MDT0003). The MDT distribution of files within each directory could probably be checked something like: # find /exafs/d0/d1/d2/mdt_stripe -mindepth 2 -maxdepth 2 -type d |
while read D; do
echo $D
find $D -type f | xargs lfs getstripe -m | sort | uniq -c
done
though this may take a few minutes if there are a lot of files. |
| Comment by Shuichi Ihara [ 31/Mar/22 ] |
|
This is mdtest to a single shared dir, not unique dir operation. So, all files are cretead in a sub directory. [root@ec01 ~]# lfs find /exafs/d0/d1/d2/mdt_stripe -type d /exafs/d0/d1/d2/mdt_stripe [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 720861 82329635 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 192389 82858107 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 190695 82859801 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 177213 82873283 1% /exafs[MDT:3] [root@ec01 ~]# lfs find /exafs/d0/d1/d2/mdt_stripe -type d /exafs/d0/d1/d2/mdt_stripe /exafs/d0/d1/d2/mdt_stripe/test-dir.0-0 /exafs/d0/d1/d2/mdt_stripe/test-dir.0-0/mdtest_tree.0 [root@ec01 ~]# lfs find /exafs/d0/d1/d2/mdt_stripe -type d | xargs lfs getdirstripe -m 1 1 1 And, unbalanced distibution is not associated to particular a MDT. [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 301 83050195 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 281 83050215 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 281 83050215 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 280 83050216 1% /exafs[MDT:3] [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 190707 82859789 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 192389 82858107 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 177215 82873281 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 720848 82329648 1% /exafs[MDT:3] [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -v -d /exafs/d0/d1/d2/mdt_stripe/ -r [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 720861 82329635 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 192389 82858107 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 190703 82859793 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 177214 82873282 1% /exafs[MDT:3] |
| Comment by Lai Siyao [ 31/Mar/22 ] |
|
This should be because files are treated as temp file, which are created on the same MDT. All digit suffix filename (if the suffix length after "." is 8) are treated as temp file with crush hash type. Ihara, can you list some typical filename in the bottom level? |
| Comment by Andreas Dilger [ 31/Mar/22 ] |
|
Lai - two things about the "temp" filenames:
The version of mdtest that I'm using locally only has numbers in the suffix: # ls mdtest-easy/test-dir.0-0/mdtest_tree.0.0 file.mdtest.1.127 file.mdtest.1.128 file.mdtest.1.129 file.mdtest.1.13 file.mdtest.1.130 file.mdtest.1.131 file.mdtest.1.132 However, it might be getting confused by the extra '.' in the name if there are more files, like "file.mdtest.1.345678" or "file.mdtest.12.45678"? This would incorrectly fail the "(digit >= suffixlen -1)" check because the second '.' is not counted in digit or upper or lower. There should probably be an additional check that there aren't non-alphanumeric characters in the suffix:
if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
upper == suffixlen || lower == suffixlen)
return false;
if (type == LMV_HASH_TYPE_CRUSH2 && digit + upper + lower != suffixlen)
return false;
Unfortunately, this changes the hash function subtly, so a new "LMV_HASH_TYPE_CRUSH2" hash type is needed for the new behavior. Otherwise, clients may think they know which MDT a particular filename is on but it would be wrong. |
| Comment by Andreas Dilger [ 31/Mar/22 ] |
|
I'm working on a patch to add the CRUSH2 hash, but for now we are planning to land patch https://review.whamcloud.com/46950 " |
| Comment by Lai Siyao [ 01/Apr/22 ] |
|
Andreas, how about retry lookup on client? if the filename looks to be a temp file in above format, if it's not located on the MDT with name like "foo", then lookup again with full name hash? |
| Comment by Andreas Dilger [ 02/Apr/22 ] |
that would only work for new clients that are patched to repeat the lookup, and at that point the clients could also be patched to understand CRUSH2. Older clients will not repeat the lookup, so they would not be able to find the filename on the MDT where it should be located (based on older CRUSH hash). That might also cause the same filename to be created twice in the same directory (old clients on one MDT, new clients on a different MDT). Having a different hash function that old clients do not understand should result in them looking for the filename on all MDTs, and prevent them from creating new files in that directory (return -EBADFD). I'm having a hard time to convince myself this is code correct. The comment in the commit message says: LU-13481 dne: improve temp file name check Previously if all but two characters in file name suffix are digit, it's not treated as temp file, as is too strict if suffix length is short, e.g. 6. Change it to allow one character, and this non-digit character should not be the starting character. Besides the problem with ".-_" characters in the suffix (which would make count of digits/upper/lower too small and fail the suffixlen check), it doesn't look like the isdigit() check is correct:
if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
upper == suffixlen || lower == suffixlen)
return false;
If "digit >= suffixlen -1" (say name = "foo.12345678", digit = 8, suffixlen = 8) this check will fail (and return "true" for the temp filename check) because "1" is a digit. I think this is supposed to be just "isdigit(name[])" (no '!'). |
| Comment by Lai Siyao [ 02/Apr/22 ] |
|
It's in accordance with the comment, isn't it? The comment says "foo.a1234567" shouldn't be treated as temp file, but "foo.12a34567" and "foo.12345678" should be. IMO "foo.123.4567" shouldn't be treated as a temp file, that is, isdot should be checked here, if suffix contains ".", it should return false. |
| Comment by Gerrit Updater [ 02/Apr/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46950/ |
| Comment by Andreas Dilger [ 02/Apr/22 ] |
|
Definitely "foo.12345678" should not be considered a temp file, since this is a common case (eg file.YYYYMMDD). The chance of a 6-number temp file is 1/40k, and an 8-number temp file being hit randomly is less than 1/2M. The original code even considered 6/6, 5/6 and 4/6 numbers to not be temp files (ie. "digit >= suffixlen - 2") , but 4/6 numbers was too easily hit by mktemp. It was supposed to keep 8/8 and 7/8 as non-temp files as long as 7/8 was like "file.f1234567". The problem is that the 8/8 case also fails because the first char is a digit, so "!isdigit(name[namelen-suffixlen])" fails, and it doesn't matter if the "(digit >= suffixlen - 1)" part is true or not because the "false" check is not met, and "true" is returned. The proper check should be something like:
if (digit == suffixlen || upper == suffixlen || lower == suffixlen ||
(digit == suffixlen - 1 && !isdigit(name[namelen - suffixlen])))
return false;
|
| Comment by Peter Jones [ 02/Apr/22 ] |
|
I believe that the part intend for 2.15 has now landed. There is still some outstanding work to track to work out the issues with the new hash function so we can switch back to that but that can be tracked under a new JIRA for 2.16/2.15.x |