[LU-14146] Massive directory metadata operation performance decrease Created: 21/Nov/20 Updated: 25/Aug/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL7 running the latest master. |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
While comparing the results of Lustre 2.12 LTS and the latest master version of Lustre a noticeable decrease of performance was seen with mdtest. I did a git bisected to discover the source of this regression to be https://review.whamcloud.com/#/c/35825. The results are as follows before and after the patch landed: mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s) Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest' Path: /lustre/crius/stf008/scratch/jsimmons FS: 806.0 TiB Used FS: 0.0% Inodes: 4298.4 Mi Used Inodes: 0.0%
Nodemap: 111111000000000000000000000000000000000000000000000000 54 tasks, 53946 files/directories
SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- — ---- ------- Directory creation : 10929.296 10229.518 10551.707 269.772 Directory stat : 45397.727 44566.564 45101.666 285.915 Directory removal : 14509.663 13822.493 14198.406 282.821 File creation : 6180.597 6097.217 6142.435 30.776 File stat : 43473.036 31895.809 37446.331 4316.809 File read : 18142.575 16228.362 17383.867 750.963 File removal : 7412.350 7061.313 7227.328 118.574 Tree creation : 3478.676 2899.108 3328.345 219.993 Tree removal : 764.549 583.999 672.962 59.213 – finished at 11/20/2020 10:55:32 – And after landing the patch: mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s) Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest' Path: /lustre/crius/stf008/scratch/jsimmons FS: 806.0 TiB Used FS: 0.0% Inodes: 4667.2 Mi Used Inodes: 0.0%
Nodemap: 111111000000000000000000000000000000000000000000000000 54 tasks, 53946 files/directories
SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- — ---- ------- Directory creation : 1823.563 1497.613 1687.840 105.551 Directory stat : 26132.733 18515.334 23994.365 2847.665 Directory removal : 2721.120 1783.451 2383.377 329.561 File creation : 6880.575 6428.112 6702.467 153.483 File stat : 44519.556 38352.962 42705.219 2270.727 File read : 19180.528 18379.633 18696.723 276.664 File removal : 9229.889 8597.003 8889.050 222.742 Tree creation : 48.123 42.574 46.095 1.908 Tree removal : 39.628 10.159 28.961 9.911 – finished at 11/20/2020 10:18:56 – |
| Comments |
| Comment by Lai Siyao [ 22/Nov/20 ] |
|
Is this tested on DNE system? |
| Comment by James A Simmons [ 22/Nov/20 ] |
|
Yes, 48 MDTs on 24 MDS servers. The default directory is set to just one MDT. |
| Comment by James A Simmons [ 23/Nov/20 ] |
|
Also we did test on a single MDT setup and it showed the same results. |
| Comment by Shuichi Ihara [ 07/Dec/20 ] |
|
Hm. I didn't confirm yet regressions on my test enviorment that was 2 x MDS/MDT, 4 x OSS/OST and 40 clients, 320 processes. Single MDT, no DNE setup [root@ec01 ~]# mkdir /ai400x/mdt0/ [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=8 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt0/ lustre-2.12.5 SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation : 40166.668 29889.711 36926.039 3721.687 Directory stat : 181972.127 163686.767 171839.868 6830.690 Directory removal : 72596.455 64023.722 67605.022 2865.954 File creation : 61473.277 33357.894 49626.877 8756.461 File stat : 182319.720 172986.277 176813.231 3043.802 File read : 96716.113 91506.710 94630.270 1908.325 File removal : 73915.610 71204.711 72434.411 1189.090 Tree creation : 4883.894 4224.418 4489.395 238.875 Tree removal : 121.542 119.264 120.320 0.870 master (commit: e5c8f66) SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation : 42269.194 40350.392 41677.374 700.045 Directory stat : 169511.255 151004.927 160570.614 7062.870 Directory removal : 73562.337 66378.685 71053.900 2461.351 File creation : 71462.132 38186.018 55635.280 8982.025 File stat : 320154.330 289927.273 309750.141 10796.857 File read : 88594.789 76983.081 83738.015 3793.636 File removal : 69072.712 62536.441 65716.920 2125.631 Tree creation : 4713.705 32.602 3367.272 1702.228 Tree removal : 280.514 17.496 193.251 95.416 Two MDS/MDT, DNE setup [root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe [root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=8 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/ lustre-2.12.5 SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation : 4091.011 3697.938 3995.214 150.244 Directory stat : 160784.657 158579.052 159864.416 885.088 Directory removal : 3346.025 3289.510 3319.668 18.116 File creation : 71590.829 36867.505 61846.370 11509.343 File stat : 353953.112 316962.501 339006.051 13982.944 File read : 185607.391 180289.664 182559.629 1791.647 File removal : 129448.873 127389.601 128672.603 719.608 Tree creation : 543.402 3.326 111.930 215.737 Tree removal : 116.905 97.208 104.334 6.869 master (commit: e5c8f66) SUMMARY rate: (of 5 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation : 4244.489 4153.787 4204.182 35.519 Directory stat : 45417.105 44573.071 45017.015 327.182 Directory removal : 3253.162 3166.240 3206.250 34.838 File creation : 103608.274 64457.023 91383.228 10534.101 File stat : 513544.947 489825.324 505082.825 9879.991 File read : 169268.803 160600.607 165519.732 3198.057 File removal : 116843.421 111635.972 114741.924 1985.082 Tree creation : 189.871 4.421 42.099 73.888 Tree removal : 218.595 190.424 208.575 10.343 We know a regression for DIR stat in DNE setup with master branch. That's a known issue in |
| Comment by James A Simmons [ 07/Dec/20 ] |
|
Excellent. Let me try the latest master then. Looking at the fix I think it only addressed the stats issues. Not the creation and removal of directories. Removal + creation rates are 1/10 what 2.12 LTS can do. |
| Comment by Lai Siyao [ 08/Dec/20 ] |
|
Hi Ihara, can you help create flamegraph on both client and MDS in your test? |
| Comment by Shuichi Ihara [ 13/Dec/20 ] |
|
James, I still can't repo your problem on my test system and results with 2.12.5 and master are still consistent. Would you have an chance to test on the latest master again? Operation Max Min Mean Std Dev --------- --- — ---- ------- Directory creation : 10929.296 10229.518 10551.707 269.772 Directory stat : 45397.727 44566.564 45101.666 285.915 Directory removal : 14509.663 13822.493 14198.406 282.821 btw, above your higher Dir creation and removal, I wonder if you had -D (inherited) option in 'lfs setdirstripe' properly? laisiyao sorry confusions, what I wanted to say, I couldn't see any regressions in master on test system. Please see my posted results you already realized though. |
| Comment by James A Simmons [ 14/Dec/20 ] |
|
I'm using this setup: lfs setdirstripe -c $MDTCOUNT -i -1 $OUTDIR lfs setdirstripe -D -c $MDTCOUNT -i -1 $OUTDIR lfs setstripe -c $OSTCOUNT $OUTDIR and mdtest (latest) command is: usr/lib64/openmpi/bin/mpirun -npernode 6 -mca pml ob1 -mca btl openib,sm,self -bind-to core:overload-allowed --allow-run-as-root -machinefile $BINDIR/$(arch)/hostfile $BINDIR/$(arch)/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i $ITER -z 2 -d $OUTDIR |
| Comment by James A Simmons [ 14/Dec/20 ] |
|
What is your testing setup? |
| Comment by James A Simmons [ 16/Dec/20 ] |
|
Fire have been put out. I'm looking at this now. |
| Comment by Shuichi Ihara [ 17/Dec/20 ] |
My configurartion was included in posted my results, but it was two MDSs and two MDTs and I used exact same mdtest options you tested below. [root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe [root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=8 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/ |
| Comment by James A Simmons [ 08/Jan/21 ] |
|
I'm using 48 MDTs (2 per MDS). This is with ZFS. The main function costing the most time is dt_declare_create() being called by lod_sub_declare_create(). I wonder if we need a precreate like OST have. |
| Comment by Peter Jones [ 25/Jan/21 ] |
|
James While I understand that there are ongoing investigations on how to address your performance issues I don't think that these are unique to 2.14 Peter |
| Comment by James A Simmons [ 25/Jan/21 ] |
|
This started at the end of the 2.13 cycle. I hope too address this issue for 2.15. |
| Comment by Lai Siyao [ 07/Feb/22 ] |
|
Metadata performance on DNE system can be improved in these aspects:
|
| Comment by Andreas Dilger [ 07/Feb/22 ] |
|
Lai, could you please file separate LU tickets for these issues and link them here. I think the other major improvement is to fix multiple OUT RPCs in flight per target (patch for that already). I'm definitely interested to see some of these improvements you mentioned. Of course, even better than optimizing remote RPCs is to avoid doing them in the first place, so it makes sense to optimize the round-robin allocation to be smarter - avoiding remote subdirs if the parent was created by the same client until the SEQ runs out. |
| Comment by Lai Siyao [ 08/Mar/22 ] |
|
Ihara, https://review.whamcloud.com/#/c/46735/ contains the major changes of DNE metadata improvements, will you run some benchmarks? |