[LU-14466] metadata performance slows if the metadata migration is process is running Created: 23/Feb/21  Updated: 22/Mar/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14465 unlink fails when if the metadata mig... Open
is related to LU-14459 DNE3: directory auto split during create Open
is related to LU-14212 DNE3: directory migration progress mo... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here is baseline (with enable_dir_auto_split=0) of unlink speed on the sinlge MDT in this configuration.

[root@ec01 ~]# mkdir  /ai400x/testdir
[root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C
[root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 2000 -F -v -d /ai400x/testdir/ -r

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation             :          0.000          0.000          0.000          0.000
   File stat                 :          0.000          0.000          0.000          0.000
   File read                 :          0.000          0.000          0.000          0.000
   File removal              :      20607.477      20607.470      20607.473          0.002
   Tree creation             :          0.000          0.000          0.000          0.000
   Tree removal              :          7.732          7.732          7.732          0.000
V-1: Entering PrintTimestamp...

Same test with enabling auto restripe (enable_dir_auto_split=1) and unlink files when metadata migration is running behind.

[root@ec01 ~]# mkdir  /ai400x/testdir
[root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C

migration already triggered

[root@ec01 ~]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
ai400x-MDT0000_UUID     83050496     4116569    78933927   5% /ai400x[MDT:0] 
ai400x-MDT0001_UUID     83050496      761581    82288915   1% /ai400x[MDT:1] 
ai400x-MDT0002_UUID     83050496      761753    82288743   1% /ai400x[MDT:2] 
ai400x-MDT0003_UUID     83050496      761155    82289341   1% /ai400x[MDT:3] 
ai400x-OST0000_UUID     55574528     1279804    54294724   3% /ai400x[OST:0] 
ai400x-OST0001_UUID     55574528     1281048    54293480   3% /ai400x[OST:1] 
ai400x-OST0002_UUID     55574528     1284039    54290489   3% /ai400x[OST:2] 
ai400x-OST0003_UUID     55574528     1288486    54286042   3% /ai400x[OST:3] 
ai400x-OST0004_UUID     55574528     1310890    54263638   3% /ai400x[OST:4] 
ai400x-OST0005_UUID     55574528     1296812    54277716   3% /ai400x[OST:5] 
ai400x-OST0006_UUID     55574528     1292424    54282104   3% /ai400x[OST:6] 
ai400x-OST0007_UUID     55574528     1293098    54281430   3% /ai400x[OST:7] 

filesystem_summary:    332201984     6401058   325800926   2% /ai400x

[root@ec01 ~]#  lfs getdirstripe /ai400x/testdir/test-dir.0-0/mdtest_tree.0/
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating
mdtidx		 FID[seq:oid:ver]
     0		 [0x200000e09:0xbaf6:0x0]		
     2		 [0x2c0000c06:0x1e76c:0x0]		
     1		 [0x300000c07:0x1e6c4:0x0]		
     3		 [0x340000c07:0x1e88c:0x0]		

start removing all files.

[root@ec01 ~]#  clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r
SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation             :          0.000          0.000          0.000          0.000
   File stat                 :          0.000          0.000          0.000          0.000
   File read                 :          0.000          0.000          0.000          0.000
   File removal              :       5268.140       5268.139       5268.139          0.000
   Tree creation             :          0.000          0.000          0.000          0.000
   Tree removal              :         11.465         11.465         11.465          0.000
V-1: Entering PrintTimestamp...

So, 20K (single MDT without migration) vs 5K (4 x MDT with running migration) unlink opes/sec



 Comments   
Comment by Andreas Dilger [ 23/Feb/21 ]

I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames.

Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink.

Comment by Andreas Dilger [ 23/Feb/21 ]

Shuichi, you could test the effect of having an earlier auto-split by running a "stat" on the directory shortly after the mdtest starts. Please also set "dir_restripe_nsonly=1" for your future testing.

Comment by Shuichi Ihara [ 24/Feb/21 ]

I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames.

I thought and did default dir_restripe_nsonly=1, but it was no big differences. let me re-test for double check.

Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink.

Sure, but in this test, I wanted to see performance impacts when the migration is process running. Not only auto restriping, but e.g. when if administrator triggers metadata migration and user removes files.

Comment by Shuichi Ihara [ 24/Feb/21 ]

Lai, Andreas, btw is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file"
That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete.

Comment by Shuichi Ihara [ 24/Feb/21 ]
[root@es400nvx1-vm1 ~]#  clush -a lctl get_param mdt.*.enable_dir_restripe mdt.*.enable_dir_auto_split mdt.*.dir_split_count mdt.*.dir_split_delta mdt.*.dir_restripe_nsonly lod.*.mdt_hash | dshbak
----------------
es400nvx1-vm1
----------------
mdt.ai400x-MDT0000.enable_dir_restripe=0
mdt.ai400x-MDT0000.enable_dir_auto_split=1
mdt.ai400x-MDT0000.dir_split_count=50000
mdt.ai400x-MDT0000.dir_split_delta=4
mdt.ai400x-MDT0000.dir_restripe_nsonly=1
lod.ai400x-MDT0000-mdtlov.mdt_hash=fnv_1a_64
....
[root@ec01 ~]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
ai400x-MDT0000_UUID     83050496     4867975    78182521   6% /ai400x[MDT:0] 
ai400x-MDT0001_UUID     83050496      646310    82404186   1% /ai400x[MDT:1] 
ai400x-MDT0002_UUID     83050496      645212    82405284   1% /ai400x[MDT:2] 
ai400x-MDT0003_UUID     83050496      645924    82404572   1% /ai400x[MDT:3] 
ai400x-OST0000_UUID     55574528      806686    54767842   2% /ai400x[OST:0] 
ai400x-OST0001_UUID     55574528      806334    54768194   2% /ai400x[OST:1] 
ai400x-OST0002_UUID     55574528      811480    54763048   2% /ai400x[OST:2] 
ai400x-OST0003_UUID     55574528      811520    54763008   2% /ai400x[OST:3] 
ai400x-OST0004_UUID     55574528      810951    54763577   2% /ai400x[OST:4] 
ai400x-OST0005_UUID     55574528      811091    54763437   2% /ai400x[OST:5] 
ai400x-OST0006_UUID     55574528      807026    54767502   2% /ai400x[OST:6] 
ai400x-OST0007_UUID     55574528      806912    54767616   2% /ai400x[OST:7] 

filesystem_summary:    332201984     6805421   325396563   3% /ai400x

[root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation             :          0.000          0.000          0.000          0.000
   File stat                 :          0.000          0.000          0.000          0.000
   File read                 :          0.000          0.000          0.000          0.000
   File removal              :       6134.857       6134.856       6134.857          0.000
   Tree creation             :          0.000          0.000          0.000          0.000
   Tree removal              :         10.607         10.607         10.607          0.000
V-1: Entering PrintTimestamp...
-- finished at 02/24/2021 17:14:50 --

dir_restripe_nsonly=0 or 1 were no big diferences on the performance impact. So, we still need two RPCs for unlink if migration is running, but still 4x slower. it's a bit too overheads isn't it?

Comment by Andreas Dilger [ 24/Feb/21 ]

is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file"
That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete.

LU-13482 was filed for "lfs migrate" stats, but it is mostly focussed on OST object migration. It probably makes sense to have a separate ticket for tracking stats for directory migration since this is done on the MDS instead of the client.

Comment by Andreas Dilger [ 24/Feb/21 ]

It looks like LU-14212 is the right ticket for directory split/migration monitoring.

Generated at Sat Feb 10 03:10:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.