Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14146

Massive directory metadata operation performance decrease

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0, Lustre 2.14.0
    • None
    • RHEL7 running the latest master.
    • 3
    • 9223372036854775807

    Description

      While comparing the results of Lustre 2.12 LTS and the latest master version of Lustre a noticeable decrease of performance was seen with mdtest. I did a git bisected to discover the source of this regression to be https://review.whamcloud.com/#/c/35825. The results are as follows before and after the patch landed:

      mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s)

      Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest'

      Path: /lustre/crius/stf008/scratch/jsimmons

      FS: 806.0 TiB   Used FS: 0.0%   Inodes: 4298.4 Mi   Used Inodes: 0.0%

       

      Nodemap: 111111000000000000000000000000000000000000000000000000

      54 tasks, 53946 files/directories

       

      SUMMARY rate: (of 5 iterations)

         Operation                      Max            Min           Mean        Std Dev

         ---------                      ---            —           ----        -------

         Directory creation        :      10929.296      10229.518      10551.707        269.772

         Directory stat            :      45397.727      44566.564      45101.666        285.915

         Directory removal         :      14509.663      13822.493      14198.406        282.821

         File creation             :       6180.597       6097.217       6142.435         30.776

         File stat                 :      43473.036      31895.809      37446.331       4316.809

         File read                 :      18142.575      16228.362      17383.867        750.963

         File removal              :       7412.350       7061.313       7227.328        118.574

         Tree creation             :       3478.676       2899.108       3328.345        219.993

         Tree removal              :        764.549        583.999        672.962         59.213

      – finished at 11/20/2020 10:55:32 –

      And after landing the patch:

      mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s)

      Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest'

      Path: /lustre/crius/stf008/scratch/jsimmons

      FS: 806.0 TiB   Used FS: 0.0%   Inodes: 4667.2 Mi   Used Inodes: 0.0%

       

      Nodemap: 111111000000000000000000000000000000000000000000000000

      54 tasks, 53946 files/directories

       

      SUMMARY rate: (of 5 iterations)

         Operation                      Max            Min           Mean        Std Dev

         ---------                      ---            —           ----        -------

         Directory creation        :       1823.563       1497.613       1687.840        105.551

         Directory stat            :      26132.733      18515.334      23994.365       2847.665

         Directory removal         :       2721.120       1783.451       2383.377        329.561

         File creation             :       6880.575       6428.112       6702.467        153.483

         File stat                 :      44519.556      38352.962      42705.219       2270.727

         File read                 :      19180.528      18379.633      18696.723        276.664

         File removal              :       9229.889       8597.003       8889.050        222.742

         Tree creation             :         48.123         42.574         46.095          1.908

         Tree removal              :         39.628         10.159         28.961          9.911

      – finished at 11/20/2020 10:18:56 –

      Attachments

        Issue Links

          Activity

            [LU-14146] Massive directory metadata operation performance decrease
            laisiyao Lai Siyao added a comment -

            Ihara, https://review.whamcloud.com/#/c/46735/ contains the major changes of DNE metadata improvements, will you run some benchmarks?

            laisiyao Lai Siyao added a comment - Ihara, https://review.whamcloud.com/#/c/46735/ contains the major changes of DNE metadata improvements, will you run some benchmarks?

            Lai, could you please file separate LU tickets for these issues and link them here. I think the other major improvement is to fix multiple OUT RPCs in flight per target (patch for that already).

            I'm definitely interested to see some of these improvements you mentioned. Of course, even better than optimizing remote RPCs is to avoid doing them in the first place, so it makes sense to optimize the round-robin allocation to be smarter - avoiding remote subdirs if the parent was created by the same client until the SEQ runs out.

            adilger Andreas Dilger added a comment - Lai, could you please file separate LU tickets for these issues and link them here. I think the other major improvement is to fix multiple OUT RPCs in flight per target (patch for that already). I'm definitely interested to see some of these improvements you mentioned. Of course, even better than optimizing remote RPCs is to avoid doing them in the first place, so it makes sense to optimize the round-robin allocation to be smarter - avoiding remote subdirs if the parent was created by the same client until the SEQ runs out.
            laisiyao Lai Siyao added a comment - - edited

            Metadata performance on DNE system can be improved in these aspects:

            1. support remote PDO lock, this will greatly reduce conflicts in parent lock, and improve remote directory (non-striped) creation/unlink performance. Since filesystem-wide default directory layout is set by default, this will speed up mdtest directory creation/unlink test.
            2. remove dependency between distributed transactions started from the same MDT. In one MDT system recovery, the transactions are replayed in transaction order, but in DNE system recovery, the transactions are replayed on all MDTs in parallel, currently if two transactions have dependency, and the latter transaction is a distributed transaction, the dependency is eliminated by commit-on-sharing, however if these two transactions are started by the same MDT, they are replayed by transaction order, therefore commit-on-sharing is not needed here.
            3. downgrade remote PW/EX lock taken in distributed transactions to COS mode asynchronously after transaction stop, this can avoid commit-on-sharing by subsequent stat after distributed transactions (e.g. striped directory creation).
            4. currently directory migration locks all stripes of parent directory, it can be changed to lock source and target parent stripe only, and with the first change above, the lock conflicts can be largely reduced, this can improve directory migration/restripe/auto split performance.
            5. mdt_getattr_name_lock() return -EREMOTE immediately if request is sent to the MDT where the parent object is located, while object is on another MDT, which can avoid one mdo_getattr().
            laisiyao Lai Siyao added a comment - - edited Metadata performance on DNE system can be improved in these aspects: support remote PDO lock, this will greatly reduce conflicts in parent lock, and improve remote directory (non-striped) creation/unlink performance. Since filesystem-wide default directory layout is set by default, this will speed up mdtest directory creation/unlink test. remove dependency between distributed transactions started from the same MDT. In one MDT system recovery, the transactions are replayed in transaction order, but in DNE system recovery, the transactions are replayed on all MDTs in parallel, currently if two transactions have dependency, and the latter transaction is a distributed transaction, the dependency is eliminated by commit-on-sharing, however if these two transactions are started by the same MDT, they are replayed by transaction order, therefore commit-on-sharing is not needed here. downgrade remote PW/EX lock taken in distributed transactions to COS mode asynchronously after transaction stop, this can avoid commit-on-sharing by subsequent stat after distributed transactions (e.g. striped directory creation). currently directory migration locks all stripes of parent directory, it can be changed to lock source and target parent stripe only, and with the first change above, the lock conflicts can be largely reduced, this can improve directory migration/restripe/auto split performance. mdt_getattr_name_lock() return -EREMOTE immediately if request is sent to the MDT where the parent object is located, while object is on another MDT, which can avoid one mdo_getattr().

            This started at the end of the 2.13 cycle. I hope too address this issue for 2.15.

            simmonsja James A Simmons added a comment - This started at the end of the 2.13 cycle. I hope too address this issue for 2.15.
            pjones Peter Jones added a comment -

            James

            While I understand that there are ongoing investigations on how to address your performance issues I don't think that these are unique to 2.14

            Peter

            pjones Peter Jones added a comment - James While I understand that there are ongoing investigations on how to address your performance issues I don't think that these are unique to 2.14 Peter
            simmonsja James A Simmons added a comment - - edited

            I'm using 48 MDTs (2 per MDS). This is with ZFS. The main function costing the most time is 

            dt_declare_create() being called by lod_sub_declare_create(). I wonder if we need a precreate like OST have.

            simmonsja James A Simmons added a comment - - edited I'm using 48 MDTs (2 per MDS). This is with ZFS. The main function costing the most time is  dt_declare_create() being called by lod_sub_declare_create(). I wonder if we need a precreate like OST have.

            What is your testing setup?

            My configurartion was included in posted my results, but it was two MDSs and two MDTs and I used exact same mdtest options you tested below.

            [root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe
            [root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe
            [root@ec01 ~]#  salloc -p 40n -N 40 --ntasks-per-node=8  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/
            
            sihara Shuichi Ihara added a comment - What is your testing setup? My configurartion was included in posted my results, but it was two MDSs and two MDTs and I used exact same mdtest options you tested below. [root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe [root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=8 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/

            Fire have been put out. I'm looking at this now.

            simmonsja James A Simmons added a comment - Fire have been put out. I'm looking at this now.

            What is your testing setup?

            simmonsja James A Simmons added a comment - What is your testing setup?

            I'm using this setup:

            lfs setdirstripe -c $MDTCOUNT -i -1 $OUTDIR       

            lfs setdirstripe -D -c $MDTCOUNT -i -1 $OUTDIR       

            lfs setstripe -c $OSTCOUNT $OUTDIR

            and mdtest (latest) command is:

            usr/lib64/openmpi/bin/mpirun -npernode 6 -mca pml ob1 -mca btl openib,sm,self -bind-to core:overload-allowed --allow-run-as-root -machinefile $BINDIR/$(arch)/hostfile $BINDIR/$(arch)/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i $ITER -z 2 -d $OUTDIR

            simmonsja James A Simmons added a comment - I'm using this setup: lfs setdirstripe -c $MDTCOUNT -i -1 $OUTDIR        lfs setdirstripe -D -c $MDTCOUNT -i -1 $OUTDIR        lfs setstripe -c $OSTCOUNT $OUTDIR and mdtest (latest) command is: usr/lib64/openmpi/bin/mpirun -npernode 6 -mca pml ob1 -mca btl openib,sm,self -bind-to core:overload-allowed --allow-run-as-root -machinefile $BINDIR/$(arch)/hostfile $BINDIR/$(arch)/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i $ITER -z 2 -d $OUTDIR

            People

              laisiyao Lai Siyao
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: