[LU-14146] Massive directory metadata operation performance decrease Created: 21/Nov/20  Updated: 25/Aug/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None
Environment:

RHEL7 running the latest master.


Issue Links:
Related
is related to LU-14172 DIR Stat performance regression in st... Resolved
is related to LU-12624 DNE3: striped directory allocate stri... Resolved
is related to LU-14459 DNE3: directory auto split during create Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-15526 PDO lock for object on remote MDT Technical task Resolved Lai Siyao  
LU-15527 remove dependency between transaction... Technical task Closed Lai Siyao  
LU-15528 downgrade remote PW/EX lock taken in ... Technical task Open WC Triage  
LU-15529 optimize directory migration parent l... Technical task Resolved Lai Siyao  
LU-15530 getattr_by_name calls mdo_getattr twi... Technical task Open WC Triage  
LU-15531 optimize round-robin allocation to cr... Technical task Open WC Triage  
LU-6864 DNE3: Support multiple modify RPCs in... Technical task Resolved Hongchao Zhang  
LU-15597 add an LDLM flag to mark lock is save... Technical task Closed Lai Siyao  
LU-17003 DNE system doesn't need to support RE... Technical task Open Lai Siyao  
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While comparing the results of Lustre 2.12 LTS and the latest master version of Lustre a noticeable decrease of performance was seen with mdtest. I did a git bisected to discover the source of this regression to be https://review.whamcloud.com/#/c/35825. The results are as follows before and after the patch landed:

mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s)

Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest'

Path: /lustre/crius/stf008/scratch/jsimmons

FS: 806.0 TiB   Used FS: 0.0%   Inodes: 4298.4 Mi   Used Inodes: 0.0%

 

Nodemap: 111111000000000000000000000000000000000000000000000000

54 tasks, 53946 files/directories

 

SUMMARY rate: (of 5 iterations)

   Operation                      Max            Min           Mean        Std Dev

   ---------                      ---            —           ----        -------

   Directory creation        :      10929.296      10229.518      10551.707        269.772

   Directory stat            :      45397.727      44566.564      45101.666        285.915

   Directory removal         :      14509.663      13822.493      14198.406        282.821

   File creation             :       6180.597       6097.217       6142.435         30.776

   File stat                 :      43473.036      31895.809      37446.331       4316.809

   File read                 :      18142.575      16228.362      17383.867        750.963

   File removal              :       7412.350       7061.313       7227.328        118.574

   Tree creation             :       3478.676       2899.108       3328.345        219.993

   Tree removal              :        764.549        583.999        672.962         59.213

– finished at 11/20/2020 10:55:32 –

And after landing the patch:

mdtest-3.4.0+dev was launched with 54 total task(s) on 9 node(s)

Command line used: /lustre/crius/stf008/scratch/jsimmons/x86_64/mdtest '-n' '1000' '-p' '10' '-e' '4096' '-w' '4096' '-i' '5' '-z' '2' '-d' '/lustre/crius/stf008/scratch/jsimmons/test_mdtest'

Path: /lustre/crius/stf008/scratch/jsimmons

FS: 806.0 TiB   Used FS: 0.0%   Inodes: 4667.2 Mi   Used Inodes: 0.0%

 

Nodemap: 111111000000000000000000000000000000000000000000000000

54 tasks, 53946 files/directories

 

SUMMARY rate: (of 5 iterations)

   Operation                      Max            Min           Mean        Std Dev

   ---------                      ---            —           ----        -------

   Directory creation        :       1823.563       1497.613       1687.840        105.551

   Directory stat            :      26132.733      18515.334      23994.365       2847.665

   Directory removal         :       2721.120       1783.451       2383.377        329.561

   File creation             :       6880.575       6428.112       6702.467        153.483

   File stat                 :      44519.556      38352.962      42705.219       2270.727

   File read                 :      19180.528      18379.633      18696.723        276.664

   File removal              :       9229.889       8597.003       8889.050        222.742

   Tree creation             :         48.123         42.574         46.095          1.908

   Tree removal              :         39.628         10.159         28.961          9.911

– finished at 11/20/2020 10:18:56 –



 Comments   
Comment by Lai Siyao [ 22/Nov/20 ]

Is this tested on DNE system?

Comment by James A Simmons [ 22/Nov/20 ]

Yes, 48 MDTs on 24 MDS servers. The default directory is set to just one MDT.

Comment by James A Simmons [ 23/Nov/20 ]

Also we did test on a single MDT setup and it showed the same results.

Comment by Shuichi Ihara [ 07/Dec/20 ]

Hm. I didn't confirm yet regressions on my test enviorment that was 2 x MDS/MDT, 4 x OSS/OST and 40 clients, 320 processes.

Single MDT, no DNE setup

[root@ec01 ~]# mkdir /ai400x/mdt0/
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=8  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt0/

lustre-2.12.5
SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :      40166.668      29889.711      36926.039       3721.687
   Directory stat            :     181972.127     163686.767     171839.868       6830.690
   Directory removal         :      72596.455      64023.722      67605.022       2865.954
   File creation             :      61473.277      33357.894      49626.877       8756.461
   File stat                 :     182319.720     172986.277     176813.231       3043.802
   File read                 :      96716.113      91506.710      94630.270       1908.325
   File removal              :      73915.610      71204.711      72434.411       1189.090
   Tree creation             :       4883.894       4224.418       4489.395        238.875
   Tree removal              :        121.542        119.264        120.320          0.870

master (commit: e5c8f66)
SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :      42269.194      40350.392      41677.374        700.045
   Directory stat            :     169511.255     151004.927     160570.614       7062.870
   Directory removal         :      73562.337      66378.685      71053.900       2461.351
   File creation             :      71462.132      38186.018      55635.280       8982.025
   File stat                 :     320154.330     289927.273     309750.141      10796.857
   File read                 :      88594.789      76983.081      83738.015       3793.636
   File removal              :      69072.712      62536.441      65716.920       2125.631
   Tree creation             :       4713.705         32.602       3367.272       1702.228
   Tree removal              :        280.514         17.496        193.251         95.416

Two MDS/MDT, DNE setup

[root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe
[root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe
[root@ec01 ~]#  salloc -p 40n -N 40 --ntasks-per-node=8  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/

lustre-2.12.5
SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       4091.011       3697.938       3995.214        150.244
   Directory stat            :     160784.657     158579.052     159864.416        885.088
   Directory removal         :       3346.025       3289.510       3319.668         18.116
   File creation             :      71590.829      36867.505      61846.370      11509.343
   File stat                 :     353953.112     316962.501     339006.051      13982.944
   File read                 :     185607.391     180289.664     182559.629       1791.647
   File removal              :     129448.873     127389.601     128672.603        719.608
   Tree creation             :        543.402          3.326        111.930        215.737
   Tree removal              :        116.905         97.208        104.334          6.869


master (commit: e5c8f66)
SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       4244.489       4153.787       4204.182         35.519
   Directory stat            :      45417.105      44573.071      45017.015        327.182
   Directory removal         :       3253.162       3166.240       3206.250         34.838
   File creation             :     103608.274      64457.023      91383.228      10534.101
   File stat                 :     513544.947     489825.324     505082.825       9879.991
   File read                 :     169268.803     160600.607     165519.732       3198.057
   File removal              :     116843.421     111635.972     114741.924       1985.082
   Tree creation             :        189.871          4.421         42.099         73.888
   Tree removal              :        218.595        190.424        208.575         10.343
 

We know a regression for DIR stat in DNE setup with master branch. That's a known issue in LU-14172 and patch https://review.whamcloud.com/#/c/40863/ solved problem.

Comment by James A Simmons [ 07/Dec/20 ]

Excellent. Let me try the latest master then. Looking at the fix I think it only addressed the stats issues. Not the creation and removal of directories. Removal + creation rates are 1/10 what 2.12 LTS can do.

Comment by Lai Siyao [ 08/Dec/20 ]

Hi Ihara, can you help create flamegraph on both client and MDS in your test?

Comment by Shuichi Ihara [ 13/Dec/20 ]

James, I still can't repo your problem on my test system and results with 2.12.5 and master are still consistent. Would you have an chance to test on the latest master again?

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            —           ----        -------
   Directory creation        :      10929.296      10229.518      10551.707        269.772
   Directory stat            :      45397.727      44566.564      45101.666        285.915
   Directory removal         :      14509.663      13822.493      14198.406        282.821

btw, above your higher Dir creation and removal, I wonder if you had -D (inherited) option in 'lfs setdirstripe' properly?

laisiyao sorry confusions, what I wanted to say, I couldn't see any regressions in master on test system. Please see my posted results you already realized though.

Comment by James A Simmons [ 14/Dec/20 ]

I'm using this setup:

lfs setdirstripe -c $MDTCOUNT -i -1 $OUTDIR       

lfs setdirstripe -D -c $MDTCOUNT -i -1 $OUTDIR       

lfs setstripe -c $OSTCOUNT $OUTDIR

and mdtest (latest) command is:

usr/lib64/openmpi/bin/mpirun -npernode 6 -mca pml ob1 -mca btl openib,sm,self -bind-to core:overload-allowed --allow-run-as-root -machinefile $BINDIR/$(arch)/hostfile $BINDIR/$(arch)/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i $ITER -z 2 -d $OUTDIR

Comment by James A Simmons [ 14/Dec/20 ]

What is your testing setup?

Comment by James A Simmons [ 16/Dec/20 ]

Fire have been put out. I'm looking at this now.

Comment by Shuichi Ihara [ 17/Dec/20 ]

What is your testing setup?

My configurartion was included in posted my results, but it was two MDSs and two MDTs and I used exact same mdtest options you tested below.

[root@ec01 ~]# lfs setdirstripe -c 2 /ai400x/mdt_stripe
[root@ec01 ~]# lfs setdirstripe -c 2 -D /ai400x/mdt_stripe
[root@ec01 ~]#  salloc -p 40n -N 40 --ntasks-per-node=8  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 1000 -p 10 -e 4096 -w 4096 -i 5 -z 2 -d /ai400x/mdt_stripe/
Comment by James A Simmons [ 08/Jan/21 ]

I'm using 48 MDTs (2 per MDS). This is with ZFS. The main function costing the most time is 

dt_declare_create() being called by lod_sub_declare_create(). I wonder if we need a precreate like OST have.

Comment by Peter Jones [ 25/Jan/21 ]

James

While I understand that there are ongoing investigations on how to address your performance issues I don't think that these are unique to 2.14

Peter

Comment by James A Simmons [ 25/Jan/21 ]

This started at the end of the 2.13 cycle. I hope too address this issue for 2.15.

Comment by Lai Siyao [ 07/Feb/22 ]

Metadata performance on DNE system can be improved in these aspects:

  1. support remote PDO lock, this will greatly reduce conflicts in parent lock, and improve remote directory (non-striped) creation/unlink performance. Since filesystem-wide default directory layout is set by default, this will speed up mdtest directory creation/unlink test.
  2. remove dependency between distributed transactions started from the same MDT. In one MDT system recovery, the transactions are replayed in transaction order, but in DNE system recovery, the transactions are replayed on all MDTs in parallel, currently if two transactions have dependency, and the latter transaction is a distributed transaction, the dependency is eliminated by commit-on-sharing, however if these two transactions are started by the same MDT, they are replayed by transaction order, therefore commit-on-sharing is not needed here.
  3. downgrade remote PW/EX lock taken in distributed transactions to COS mode asynchronously after transaction stop, this can avoid commit-on-sharing by subsequent stat after distributed transactions (e.g. striped directory creation).
  4. currently directory migration locks all stripes of parent directory, it can be changed to lock source and target parent stripe only, and with the first change above, the lock conflicts can be largely reduced, this can improve directory migration/restripe/auto split performance.
  5. mdt_getattr_name_lock() return -EREMOTE immediately if request is sent to the MDT where the parent object is located, while object is on another MDT, which can avoid one mdo_getattr().
Comment by Andreas Dilger [ 07/Feb/22 ]

Lai, could you please file separate LU tickets for these issues and link them here. I think the other major improvement is to fix multiple OUT RPCs in flight per target (patch for that already).

I'm definitely interested to see some of these improvements you mentioned. Of course, even better than optimizing remote RPCs is to avoid doing them in the first place, so it makes sense to optimize the round-robin allocation to be smarter - avoiding remote subdirs if the parent was created by the same client until the SEQ runs out.

Comment by Lai Siyao [ 08/Mar/22 ]

Ihara, https://review.whamcloud.com/#/c/46735/ contains the major changes of DNE metadata improvements, will you run some benchmarks?

Generated at Sat Feb 10 03:07:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.