[LU-14466] metadata performance slows if the metadata migration is process is running Created: 23/Feb/21 Updated: 22/Mar/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Here is baseline (with enable_dir_auto_split=0) of unlink speed on the sinlge MDT in this configuration. [root@ec01 ~]# mkdir /ai400x/testdir [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 2000 -F -v -d /ai400x/testdir/ -r SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File read : 0.000 0.000 0.000 0.000 File removal : 20607.477 20607.470 20607.473 0.002 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 7.732 7.732 7.732 0.000 V-1: Entering PrintTimestamp... Same test with enabling auto restripe (enable_dir_auto_split=1) and unlink files when metadata migration is running behind. [root@ec01 ~]# mkdir /ai400x/testdir [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C migration already triggered [root@ec01 ~]# lfs df -i
UUID Inodes IUsed IFree IUse% Mounted on
ai400x-MDT0000_UUID 83050496 4116569 78933927 5% /ai400x[MDT:0]
ai400x-MDT0001_UUID 83050496 761581 82288915 1% /ai400x[MDT:1]
ai400x-MDT0002_UUID 83050496 761753 82288743 1% /ai400x[MDT:2]
ai400x-MDT0003_UUID 83050496 761155 82289341 1% /ai400x[MDT:3]
ai400x-OST0000_UUID 55574528 1279804 54294724 3% /ai400x[OST:0]
ai400x-OST0001_UUID 55574528 1281048 54293480 3% /ai400x[OST:1]
ai400x-OST0002_UUID 55574528 1284039 54290489 3% /ai400x[OST:2]
ai400x-OST0003_UUID 55574528 1288486 54286042 3% /ai400x[OST:3]
ai400x-OST0004_UUID 55574528 1310890 54263638 3% /ai400x[OST:4]
ai400x-OST0005_UUID 55574528 1296812 54277716 3% /ai400x[OST:5]
ai400x-OST0006_UUID 55574528 1292424 54282104 3% /ai400x[OST:6]
ai400x-OST0007_UUID 55574528 1293098 54281430 3% /ai400x[OST:7]
filesystem_summary: 332201984 6401058 325800926 2% /ai400x
[root@ec01 ~]# lfs getdirstripe /ai400x/testdir/test-dir.0-0/mdtest_tree.0/
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating
mdtidx FID[seq:oid:ver]
0 [0x200000e09:0xbaf6:0x0]
2 [0x2c0000c06:0x1e76c:0x0]
1 [0x300000c07:0x1e6c4:0x0]
3 [0x340000c07:0x1e88c:0x0]
start removing all files. [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File read : 0.000 0.000 0.000 0.000 File removal : 5268.140 5268.139 5268.139 0.000 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 11.465 11.465 11.465 0.000 V-1: Entering PrintTimestamp... So, 20K (single MDT without migration) vs 5K (4 x MDT with running migration) unlink opes/sec |
| Comments |
| Comment by Andreas Dilger [ 23/Feb/21 ] |
|
I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames. Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink. |
| Comment by Andreas Dilger [ 23/Feb/21 ] |
|
Shuichi, you could test the effect of having an earlier auto-split by running a "stat" on the directory shortly after the mdtest starts. Please also set "dir_restripe_nsonly=1" for your future testing. |
| Comment by Shuichi Ihara [ 24/Feb/21 ] |
I thought and did default dir_restripe_nsonly=1, but it was no big differences. let me re-test for double check.
Sure, but in this test, I wanted to see performance impacts when the migration is process running. Not only auto restriping, but e.g. when if administrator triggers metadata migration and user removes files. |
| Comment by Shuichi Ihara [ 24/Feb/21 ] |
|
Lai, Andreas, btw is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file" |
| Comment by Shuichi Ihara [ 24/Feb/21 ] |
[root@es400nvx1-vm1 ~]# clush -a lctl get_param mdt.*.enable_dir_restripe mdt.*.enable_dir_auto_split mdt.*.dir_split_count mdt.*.dir_split_delta mdt.*.dir_restripe_nsonly lod.*.mdt_hash | dshbak ---------------- es400nvx1-vm1 ---------------- mdt.ai400x-MDT0000.enable_dir_restripe=0 mdt.ai400x-MDT0000.enable_dir_auto_split=1 mdt.ai400x-MDT0000.dir_split_count=50000 mdt.ai400x-MDT0000.dir_split_delta=4 mdt.ai400x-MDT0000.dir_restripe_nsonly=1 lod.ai400x-MDT0000-mdtlov.mdt_hash=fnv_1a_64 .... [root@ec01 ~]# lfs df -i UUID Inodes IUsed IFree IUse% Mounted on ai400x-MDT0000_UUID 83050496 4867975 78182521 6% /ai400x[MDT:0] ai400x-MDT0001_UUID 83050496 646310 82404186 1% /ai400x[MDT:1] ai400x-MDT0002_UUID 83050496 645212 82405284 1% /ai400x[MDT:2] ai400x-MDT0003_UUID 83050496 645924 82404572 1% /ai400x[MDT:3] ai400x-OST0000_UUID 55574528 806686 54767842 2% /ai400x[OST:0] ai400x-OST0001_UUID 55574528 806334 54768194 2% /ai400x[OST:1] ai400x-OST0002_UUID 55574528 811480 54763048 2% /ai400x[OST:2] ai400x-OST0003_UUID 55574528 811520 54763008 2% /ai400x[OST:3] ai400x-OST0004_UUID 55574528 810951 54763577 2% /ai400x[OST:4] ai400x-OST0005_UUID 55574528 811091 54763437 2% /ai400x[OST:5] ai400x-OST0006_UUID 55574528 807026 54767502 2% /ai400x[OST:6] ai400x-OST0007_UUID 55574528 806912 54767616 2% /ai400x[OST:7] filesystem_summary: 332201984 6805421 325396563 3% /ai400x [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File read : 0.000 0.000 0.000 0.000 File removal : 6134.857 6134.856 6134.857 0.000 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 10.607 10.607 10.607 0.000 V-1: Entering PrintTimestamp... -- finished at 02/24/2021 17:14:50 -- dir_restripe_nsonly=0 or 1 were no big diferences on the performance impact. So, we still need two RPCs for unlink if migration is running, but still 4x slower. it's a bit too overheads isn't it? |
| Comment by Andreas Dilger [ 24/Feb/21 ] |
|
| Comment by Andreas Dilger [ 24/Feb/21 ] |
|
It looks like LU-14212 is the right ticket for directory split/migration monitoring. |