Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14466

metadata performance slows if the metadata migration is process is running

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      Here is baseline (with enable_dir_auto_split=0) of unlink speed on the sinlge MDT in this configuration.

      [root@ec01 ~]# mkdir  /ai400x/testdir
      [root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C
      
      [root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 2000 -F -v -d /ai400x/testdir/ -r
      
      SUMMARY rate: (of 1 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :          0.000          0.000          0.000          0.000
         File stat                 :          0.000          0.000          0.000          0.000
         File read                 :          0.000          0.000          0.000          0.000
         File removal              :      20607.477      20607.470      20607.473          0.002
         Tree creation             :          0.000          0.000          0.000          0.000
         Tree removal              :          7.732          7.732          7.732          0.000
      V-1: Entering PrintTimestamp...
      

      Same test with enabling auto restripe (enable_dir_auto_split=1) and unlink files when metadata migration is running behind.

      [root@ec01 ~]# mkdir  /ai400x/testdir
      [root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C
      

      migration already triggered

      [root@ec01 ~]# lfs df -i
      UUID                      Inodes       IUsed       IFree IUse% Mounted on
      ai400x-MDT0000_UUID     83050496     4116569    78933927   5% /ai400x[MDT:0] 
      ai400x-MDT0001_UUID     83050496      761581    82288915   1% /ai400x[MDT:1] 
      ai400x-MDT0002_UUID     83050496      761753    82288743   1% /ai400x[MDT:2] 
      ai400x-MDT0003_UUID     83050496      761155    82289341   1% /ai400x[MDT:3] 
      ai400x-OST0000_UUID     55574528     1279804    54294724   3% /ai400x[OST:0] 
      ai400x-OST0001_UUID     55574528     1281048    54293480   3% /ai400x[OST:1] 
      ai400x-OST0002_UUID     55574528     1284039    54290489   3% /ai400x[OST:2] 
      ai400x-OST0003_UUID     55574528     1288486    54286042   3% /ai400x[OST:3] 
      ai400x-OST0004_UUID     55574528     1310890    54263638   3% /ai400x[OST:4] 
      ai400x-OST0005_UUID     55574528     1296812    54277716   3% /ai400x[OST:5] 
      ai400x-OST0006_UUID     55574528     1292424    54282104   3% /ai400x[OST:6] 
      ai400x-OST0007_UUID     55574528     1293098    54281430   3% /ai400x[OST:7] 
      
      filesystem_summary:    332201984     6401058   325800926   2% /ai400x
      
      [root@ec01 ~]#  lfs getdirstripe /ai400x/testdir/test-dir.0-0/mdtest_tree.0/
      lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating
      mdtidx		 FID[seq:oid:ver]
           0		 [0x200000e09:0xbaf6:0x0]		
           2		 [0x2c0000c06:0x1e76c:0x0]		
           1		 [0x300000c07:0x1e6c4:0x0]		
           3		 [0x340000c07:0x1e88c:0x0]		
      

      start removing all files.

      [root@ec01 ~]#  clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r
      SUMMARY rate: (of 1 iterations)
         Operation                      Max            Min           Mean        Std Dev
         ---------                      ---            ---           ----        -------
         File creation             :          0.000          0.000          0.000          0.000
         File stat                 :          0.000          0.000          0.000          0.000
         File read                 :          0.000          0.000          0.000          0.000
         File removal              :       5268.140       5268.139       5268.139          0.000
         Tree creation             :          0.000          0.000          0.000          0.000
         Tree removal              :         11.465         11.465         11.465          0.000
      V-1: Entering PrintTimestamp...
      

      So, 20K (single MDT without migration) vs 5K (4 x MDT with running migration) unlink opes/sec

      Attachments

        Issue Links

          Activity

            [LU-14466] metadata performance slows if the metadata migration is process is running

            It looks like LU-14212 is the right ticket for directory split/migration monitoring.

            adilger Andreas Dilger added a comment - It looks like LU-14212 is the right ticket for directory split/migration monitoring.

            is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file"
            That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete.

            LU-13482 was filed for "lfs migrate" stats, but it is mostly focussed on OST object migration. It probably makes sense to have a separate ticket for tracking stats for directory migration since this is done on the MDS instead of the client.

            adilger Andreas Dilger added a comment - is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file" That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete. LU-13482 was filed for " lfs migrate " stats, but it is mostly focussed on OST object migration. It probably makes sense to have a separate ticket for tracking stats for directory migration since this is done on the MDS instead of the client.
            [root@es400nvx1-vm1 ~]#  clush -a lctl get_param mdt.*.enable_dir_restripe mdt.*.enable_dir_auto_split mdt.*.dir_split_count mdt.*.dir_split_delta mdt.*.dir_restripe_nsonly lod.*.mdt_hash | dshbak
            ----------------
            es400nvx1-vm1
            ----------------
            mdt.ai400x-MDT0000.enable_dir_restripe=0
            mdt.ai400x-MDT0000.enable_dir_auto_split=1
            mdt.ai400x-MDT0000.dir_split_count=50000
            mdt.ai400x-MDT0000.dir_split_delta=4
            mdt.ai400x-MDT0000.dir_restripe_nsonly=1
            lod.ai400x-MDT0000-mdtlov.mdt_hash=fnv_1a_64
            ....
            
            [root@ec01 ~]# lfs df -i
            UUID                      Inodes       IUsed       IFree IUse% Mounted on
            ai400x-MDT0000_UUID     83050496     4867975    78182521   6% /ai400x[MDT:0] 
            ai400x-MDT0001_UUID     83050496      646310    82404186   1% /ai400x[MDT:1] 
            ai400x-MDT0002_UUID     83050496      645212    82405284   1% /ai400x[MDT:2] 
            ai400x-MDT0003_UUID     83050496      645924    82404572   1% /ai400x[MDT:3] 
            ai400x-OST0000_UUID     55574528      806686    54767842   2% /ai400x[OST:0] 
            ai400x-OST0001_UUID     55574528      806334    54768194   2% /ai400x[OST:1] 
            ai400x-OST0002_UUID     55574528      811480    54763048   2% /ai400x[OST:2] 
            ai400x-OST0003_UUID     55574528      811520    54763008   2% /ai400x[OST:3] 
            ai400x-OST0004_UUID     55574528      810951    54763577   2% /ai400x[OST:4] 
            ai400x-OST0005_UUID     55574528      811091    54763437   2% /ai400x[OST:5] 
            ai400x-OST0006_UUID     55574528      807026    54767502   2% /ai400x[OST:6] 
            ai400x-OST0007_UUID     55574528      806912    54767616   2% /ai400x[OST:7] 
            
            filesystem_summary:    332201984     6805421   325396563   3% /ai400x
            
            [root@ec01 ~]# clush -w  es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
            [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r
            
            SUMMARY rate: (of 1 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               File creation             :          0.000          0.000          0.000          0.000
               File stat                 :          0.000          0.000          0.000          0.000
               File read                 :          0.000          0.000          0.000          0.000
               File removal              :       6134.857       6134.856       6134.857          0.000
               Tree creation             :          0.000          0.000          0.000          0.000
               Tree removal              :         10.607         10.607         10.607          0.000
            V-1: Entering PrintTimestamp...
            -- finished at 02/24/2021 17:14:50 --
            

            dir_restripe_nsonly=0 or 1 were no big diferences on the performance impact. So, we still need two RPCs for unlink if migration is running, but still 4x slower. it's a bit too overheads isn't it?

            sihara Shuichi Ihara added a comment - [root@es400nvx1-vm1 ~]# clush -a lctl get_param mdt.*.enable_dir_restripe mdt.*.enable_dir_auto_split mdt.*.dir_split_count mdt.*.dir_split_delta mdt.*.dir_restripe_nsonly lod.*.mdt_hash | dshbak ---------------- es400nvx1-vm1 ---------------- mdt.ai400x-MDT0000.enable_dir_restripe=0 mdt.ai400x-MDT0000.enable_dir_auto_split=1 mdt.ai400x-MDT0000.dir_split_count=50000 mdt.ai400x-MDT0000.dir_split_delta=4 mdt.ai400x-MDT0000.dir_restripe_nsonly=1 lod.ai400x-MDT0000-mdtlov.mdt_hash=fnv_1a_64 .... [root@ec01 ~]# lfs df -i UUID Inodes IUsed IFree IUse% Mounted on ai400x-MDT0000_UUID 83050496 4867975 78182521 6% /ai400x[MDT:0] ai400x-MDT0001_UUID 83050496 646310 82404186 1% /ai400x[MDT:1] ai400x-MDT0002_UUID 83050496 645212 82405284 1% /ai400x[MDT:2] ai400x-MDT0003_UUID 83050496 645924 82404572 1% /ai400x[MDT:3] ai400x-OST0000_UUID 55574528 806686 54767842 2% /ai400x[OST:0] ai400x-OST0001_UUID 55574528 806334 54768194 2% /ai400x[OST:1] ai400x-OST0002_UUID 55574528 811480 54763048 2% /ai400x[OST:2] ai400x-OST0003_UUID 55574528 811520 54763008 2% /ai400x[OST:3] ai400x-OST0004_UUID 55574528 810951 54763577 2% /ai400x[OST:4] ai400x-OST0005_UUID 55574528 811091 54763437 2% /ai400x[OST:5] ai400x-OST0006_UUID 55574528 807026 54767502 2% /ai400x[OST:6] ai400x-OST0007_UUID 55574528 806912 54767616 2% /ai400x[OST:7] filesystem_summary: 332201984 6805421 325396563 3% /ai400x [root@ec01 ~]# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r SUMMARY rate: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File read : 0.000 0.000 0.000 0.000 File removal : 6134.857 6134.856 6134.857 0.000 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 10.607 10.607 10.607 0.000 V-1: Entering PrintTimestamp... -- finished at 02/24/2021 17:14:50 -- dir_restripe_nsonly=0 or 1 were no big diferences on the performance impact. So, we still need two RPCs for unlink if migration is running, but still 4x slower. it's a bit too overheads isn't it?

            Lai, Andreas, btw is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file"
            That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete.

            sihara Shuichi Ihara added a comment - Lai, Andreas, btw is it possible to add additinal stats to see progress of migration in detail? e.g. "completed migration of file/ total number of file" That is useful for not only performance evaluation like this, but also real use case. user/administrator could check migration progress in detail and expect estimated time to complete.

            I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames.

            I thought and did default dir_restripe_nsonly=1, but it was no big differences. let me re-test for double check.

            Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink.

            Sure, but in this test, I wanted to see performance impacts when the migration is process running. Not only auto restriping, but e.g. when if administrator triggers metadata migration and user removes files.

            sihara Shuichi Ihara added a comment - I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames. I thought and did default dir_restripe_nsonly=1, but it was no big differences. let me re-test for double check. Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink. Sure, but in this test, I wanted to see performance impacts when the migration is process running. Not only auto restriping, but e.g. when if administrator triggers metadata migration and user removes files.

            Shuichi, you could test the effect of having an earlier auto-split by running a "stat" on the directory shortly after the mdtest starts. Please also set "dir_restripe_nsonly=1" for your future testing.

            adilger Andreas Dilger added a comment - Shuichi, you could test the effect of having an earlier auto-split by running a "stat" on the directory shortly after the mdtest starts. Please also set " dir_restripe_nsonly=1 " for your future testing.

            I think that setting "dir_restripe_nsonly=0" causing inodes to be migrated will make the performance much slower than leaving the default "dir_restripe_nsonly=1" which only moves the filenames.

            Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink.

            adilger Andreas Dilger added a comment - I think that setting " dir_restripe_nsonly=0 " causing inodes to be migrated will make the performance much slower than leaving the default " dir_restripe_nsonly=1 " which only moves the filenames. Also, once the auto-split happens earlier during creation then the number of entries moved will be much fewer (ie. 37k) than if the split happens later (ie. 5.25M). That will reduce the impact of the restripe significantly, as it will complete more quickly, and there will be fewer remote entries that need 2 RPCs to unlink.

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: