[LU-15720] imbalanced file creation in 'crush' striped directory Created: 05/Apr/22 Updated: 30/Sep/22 Resolved: 11/Jul/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
performance regressions in stripe directory on 2.15.0 (commit;4d93fd7) were found against b2_14(commit:d4b9557). 4 x MDS (1 x MDT per MDS) 4 x OSS (2 x OSS per OSS) 40 x client [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun --allow-run-as-root -oversubscribe -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 /work/tools/bin/mdtest -n 2000 -F -i 3 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/ Here is test resutls. server: version=2.15.0_RC2_22_g4d93fd7 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 103733.203 76276.410 93728.713 15168.101 File stat 693152.731 656461.448 671671.960 19132.425 File read 259081.462 247951.008 253393.168 5569.308 File removal 145137.390 142142.699 143590.068 1499.846 Tree creation 48.035 1.922 17.475 26.467 Tree removal 35.643 15.861 24.045 10.323 server: version=2.14.0_21_gd4b9557 client: version=2.14.0_21_gd4b9557 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 138939.425 81336.388 117014.695 31167.261 File stat 1678888.952 1580356.340 1645190.276 56162.463 File read 569731.788 528830.155 546121.363 21170.387 File removal 191837.291 186597.900 188595.661 2832.527 Tree creation 120.108 0.986 51.078 61.778 Tree removal 40.863 33.203 37.987 4.171 As far as I observed this, it seems to be server side regression since because performance with lustre-2.15 clients + lustre-2.14 was ok below. server: version=2.14.0_21_gd4b9557 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 132009.360 74074.615 106514.108 29585.056 File stat 1570754.679 1457120.401 1532703.082 65457.038 File read 563710.286 540228.432 553871.772 12194.544 File removal 189557.092 186065.253 187536.946 1809.374 Tree creation 54.678 1.883 19.576 30.399 Tree removal 42.065 41.677 41.875 0.194 it seems that the following patch where regressions started. LU-14459 lmv: change default hash type to crush
Change the default hash type to CRUSH to minimize the number
of directory entries that need to be migrated.
server: version=2.14.51_197_gf269497 client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 148072.690 87600.145 127000.919 34149.618 File stat 1523849.471 1388808.972 1441253.182 72393.681 File read 562840.721 505515.837 538333.864 29552.364 File removal 197259.873 191117.823 194934.244 3331.372 Tree creation 111.869 1.707 39.426 62.755 Tree removal 44.113 30.518 36.562 6.922 server: version=2.14.2.14.51_198_gbb60caa client: version=2.15.0_RC2_22_g4d93fd7 SUMMARY rate: (of 3 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 86531.781 63506.794 72790.003 12142.761 File stat 808075.643 746570.771 784071.104 32898.551 File read 260064.500 249212.881 256291.924 6135.058 File removal 159592.539 155603.788 157752.556 2012.224 Tree creation 120.060 1.138 41.069 68.410 Tree removal 37.780 37.263 37.450 0.287 V-1: Entering PrintTimestamp... I just found MDT load balancing seems to be not working well after patch. It's unbalanced file distribution across MDTs at create. For instance, here is just file creation test in a stripe directory. Before patch (commit:f269497) mpirunp -np 640 mdtest -n 2000 -F -C -i 1 -p 10 -v -d /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320298 82730198 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320283 82730213 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320334 82730162 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320293 82730203 1% /exafs[MDT:3] After patch (commit:bb60caa) [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 192404 82858092 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 190698 82859798 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 177266 82873230 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 720852 82329644 1% /exafs[MDT:3] That's why mdtest's numbers was slower since one of MDS/MDT (MDT3 in this case) is more working longer than others. Eventually, mdtest's elapsed time is longer than balanced case. |
| Comments |
| Comment by Andreas Dilger [ 05/Apr/22 ] |
|
two things about the "temp" filenames:
The version of mdtest that I'm using locally only has numbers in the suffix: # ls mdtest-easy/test-dir.0-0/mdtest_tree.0.0 file.mdtest.1.127 file.mdtest.1.128 file.mdtest.1.129 file.mdtest.1.13 file.mdtest.1.130 file.mdtest.1.131 file.mdtest.1.132 However, it might be getting confused by the extra '.' in the name if there are more files, like "file.mdtest.1.345678" or "file.mdtest.12.45678"? This would incorrectly fail the "(digit >= suffixlen -1)" check because the second '.' is not counted in digit or upper or lower. There should probably be an additional check that there aren't non-alphanumeric characters in the suffix:
if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
upper == suffixlen || lower == suffixlen)
return false;
if (type == LMV_HASH_TYPE_CRUSH2 && digit + upper + lower != suffixlen)
return false;
Unfortunately, this changes the hash function subtly, so a new "LMV_HASH_TYPE_CRUSH2" hash type is needed for the new behavior. Otherwise, clients may think they know which MDT a particular filename is on but it would be wrong. I'm having a hard time to convince myself this is code correct. The comment in the commit message says: LU-13481 dne: improve temp file name check Previously if all but two characters in file name suffix are digit, it's not treated as temp file, as is too strict if suffix length is short, e.g. 6. Change it to allow one character, and this non-digit character should not be the starting character. Besides the problem with ".-_" characters in the suffix (which would make count of digits/upper/lower too small and fail the suffixlen check), it doesn't look like the isdigit() check is correct:
if ((digit >= suffixlen - 1 && !isdigit(name[namelen - suffixlen])) ||
upper == suffixlen || lower == suffixlen)
return false;
If "digit >= suffixlen -1" (say name = "foo.12345678", digit = 8, suffixlen = 8) this check will fail (and return "true" for the temp filename check) because "1" is a digit. I think this is supposed to be just "isdigit(name[])" (no '!'). The original code even considered 6/6, 5/6 and 4/6 numbers to not be temp files (ie. "digit >= suffixlen - 2") , but 4/6 numbers was too easily hit by mktemp. It was supposed to keep 8/8 and 7/8 as non-temp files as long as 7/8 was like "file.f1234567". The problem is that the 8/8 case also fails because the first char is a digit, so "!isdigit(name[namelen-suffixlen])" fails, and it doesn't matter if the "(digit >= suffixlen - 1)" part is true or not because the "false" check is not met, and "true" is returned. The proper check should be something like:
if (digit == suffixlen || upper == suffixlen || lower == suffixlen ||
(digit == suffixlen - 1 && !isdigit(name[namelen - suffixlen])))
return false;
|
| Comment by Gerrit Updater [ 08/Apr/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47015 |
| Comment by Shuichi Ihara [ 13/Apr/22 ] |
|
Test Configurations 1 x MDS(1xMDT, 12 CPU cores, 142GB RAM) 4 x OSS(2xOST/OSS) 40 x client(16 CPU cores, 96GB RAM) IB-HDR100 network [root@ec01 ~]# mkdir -p /exafs/d0/d1/d2/mdt_stripe/ [root@ec01 ~]# lfs setdirstripe -c 4 -D /exafs/d0/d1/d2/mdt_stripe/ 2.15.0_RC2_39_g42a6d1f ( [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 111117.930 111117.930 111117.930 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 135.910 135.910 135.910 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 192396 82858100 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 720843 82329653 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 177209 82873287 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 190695 82859801 1% /exafs[MDT:3] 2.15.0_RC2_40_g0090b6f [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 150766.352 150766.352 150766.352 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 153.942 153.942 153.942 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320296 82730200 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320282 82730214 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320285 82730211 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320283 82730213 1% /exafs[MDT:3] 2.15.0_RC3 + https://review.whamcloud.com/47015 [root@ec01 ~]# mpirun -np 640 mdtest -n 2000 -F -i 1 -p 30 -v -d /exafs/d0/d1/d2/mdt_stripe/ -C SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation 149746.028 149746.028 149746.028 0.000 File stat 0.000 0.000 0.000 0.000 File read 0.000 0.000 0.000 0.000 File removal 0.000 0.000 0.000 0.000 Tree creation 14.232 14.232 14.232 0.000 Tree removal 0.000 0.000 0.000 0.000 [root@ec01 ~]# lfs df -i | grep MDT exafs-MDT0000_UUID 83050496 320296 82730200 1% /exafs[MDT:0] exafs-MDT0001_UUID 83050496 320289 82730207 1% /exafs[MDT:1] exafs-MDT0002_UUID 83050496 320289 82730207 1% /exafs[MDT:2] exafs-MDT0003_UUID 83050496 320290 82730206 1% /exafs[MDT:3] patch https://review.whamcloud.com/47015 is working fine for rule of mdtest's files and no performance degradations found. |
| Comment by Gerrit Updater [ 11/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47015/ |
| Comment by Peter Jones [ 11/Jul/22 ] |
|
Landed for 2.16 |