[LU-14172] DIR Stat performance regression in striped dir Created: 02/Dec/20  Updated: 09/Dec/20  Resolved: 09/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Critical
Reporter: Shuichi Ihara Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13437 rename may miss revoking LOOKUP lock ... Resolved
is related to LU-14146 Massive directory metadata operation ... Open
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

 There is a metadata (DIR Stat) performance regression in 2.12.6 (RC1). It looks like that regression is exist in the part of striped directory and server side.
Here is a reproducer and test results.

client: version=2.12.6_RC1_1_g327c8b7
server: version=2.12.6_RC1_1_g327c8b7 or lustre-2.12.5
# mkdir /ai400x/mdt0
# lfs setdirstripe -c 4 /ai400x/mdt_stripe
# lfs setdirstripe -c 4 -D /ai400x/mdt_stripe

#  salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -i 3 -p 10 -n 1500 -u -D -d $PATH

Single MDT without DNE
Server: Lustre-2.12.5

SUMMARY rate: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :      54315.552      50037.128      52618.576       1855.106
   Directory stat            :     186516.109     184354.609     185726.143        972.887
   Directory removal         :      66572.651      64990.546      65627.103        681.777
   Tree creation             :         46.771         24.099         36.301          9.336
   Tree removal              :         16.926         13.890         15.720          1.316

Server: Lustre-2.12.6-RC1

SUMMARY rate: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :      68098.113      59185.349      62208.643       4164.966
   Directory stat            :     193338.869     192650.348     193031.824        285.743
   Directory removal         :      65905.804      64842.618      65212.728        490.440
   Tree creation             :         44.234         33.906         39.452          4.251
   Tree removal              :         17.024         15.068         16.279          0.864

Stripe Directory across four MDTs
Server: Lustre-2.12.5

SUMMARY rate: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6385.748       5929.670       6113.851        196.251
   Directory stat            :     166190.895     162991.180     164733.372       1321.263
   Directory removal         :       4789.518       4294.122       4584.600        211.099
   Tree creation             :         13.200          1.102          6.937          4.948
   Tree removal              :          9.126          8.479          8.810          0.264

Server: Lustre-2.12.6-RC1

SUMMARY rate: (of 3 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6694.539       6505.265       6613.160         79.512
   Directory stat            :      49873.850      48817.530      49260.117        447.881   <--- This is regression.
   Directory removal         :       4768.841       4253.124       4592.927        240.327
   Tree creation             :         13.490          0.705          7.321          5.229
   Tree removal              :          9.051          8.441          8.774          0.252


 Comments   
Comment by Peter Jones [ 02/Dec/20 ]

Lai

Is this related to the LU-13437 changes?

Peter

Comment by Lai Siyao [ 04/Dec/20 ]

Yes, and the cause is that directory stripe revalidate takes more time in checking it's a stripe (see mdt_object_is_shard()), I made a simple fix and the result looks good, I'll tidy it up and push later.

Comment by Gerrit Updater [ 04/Dec/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40863
Subject: LU-14172 lmv: optimize dir shard revalidate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1e31225721c98ab48c8a4572cc59b3661cbe1dda

Comment by Shuichi Ihara [ 04/Dec/20 ]

Here is test results on master branch (commit:e5c8f66) and reproduced same regression in DIR stat that I saw on lustre-2.12.6-RC1.

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6323.394       6141.754       6238.210         74.580
   Directory stat            :      48295.593      46827.765      47645.451        610.794
   Directory removal         :       4336.014       4274.571       4315.516         28.952
   Tree creation             :         11.842          0.614          4.587          5.138
   Tree removal              :          9.204          8.894          9.048          0.126

And, unfortunueotry, patch 40863 against master doesn't solve problem.

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6437.097       6084.071       6279.844        146.672
   Directory stat            :      47235.709      44762.233      46347.987       1123.868
   Directory removal         :       4745.993       4348.202       4504.821        173.053
   Tree creation             :          6.530          0.789          2.762          2.665
   Tree removal              :          8.983          8.477          8.741          0.207
Comment by Gerrit Updater [ 04/Dec/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40872
Subject: LU-14172 mds: disable GETATTR_PFID feature
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: df61386547f026e6d4f6ca7878d1485d15f7e784

Comment by Shuichi Ihara [ 05/Dec/20 ]

It looks that patch https://review.whamcloud.com/40863 fixes regression after patch applied both server and client side. Previous test was that the patch only applied on server side, but I realized changes in patch contained both server and client.

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6417.575       6007.465       6186.944        171.288
   Directory stat            :     143940.330     139376.396     141106.210       2020.042
   Directory removal         :       4677.840       4377.965       4569.627        135.902
   Tree creation             :         13.348          0.656          5.006          5.901
   Tree removal              :          8.832          8.782          8.810          0.021

This numbers is still a bit lower than 2.12.5, but I don't have baseline number on master without this regression impacts. So, it might be other issues in master if we compare against 2.12.5.
Anyway, for b2_12, let me back b2_12 and check with backport patch Lai provided if the performance is back as same level of 2.12.5.

Comment by Gerrit Updater [ 05/Dec/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40875
Subject: LU-14172 lmv: optimize dir shard revalidate
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0d603e858ee236c779516d7672c14deaa6749e5c

Comment by Shuichi Ihara [ 05/Dec/20 ]

Here is final test results apple to apple.

# mkdir /ai400x/mdt0
# lfs setdirstripe -c 4 /ai400x/mdt_stripe
# lfs setdirstripe -c 4 -D /ai400x/mdt_stripe

#  salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -i 3 -p 10 -n 1500 -u -D -d $PATH

2.12.5

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6173.268       5805.240       5933.825        169.465
   Directory stat            :     151800.690     148071.970     150256.305       1587.764
   Directory removal         :       4648.674       4173.113       4417.583        194.376
   Tree creation             :         12.984          0.756          6.940          4.993
   Tree removal          

2.12.6-RC1

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6344.954       5834.617       6020.427        230.277
   Directory stat            :      44887.807      43460.779      43964.038        654.049
   Directory removal         :       4559.802       4114.146       4392.390        198.099
   Tree creation             :         13.336          0.734          7.153          5.148
   Tree removal              :          8.723          8.120          8.359          0.261

2.12.6-RC1 + https://review.whamcloud.com/40872

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6117.054       5850.628       5941.143        124.404
   Directory stat            :     151638.423     143319.490     148509.338       3695.492
   Directory removal         :       4498.161       3971.102       4219.711        216.202
   Tree creation             :         12.974          0.990          8.916          5.605
   Tree removal              :          8.616          8.349          8.458          0.114

2.12.6-RC1 + patch https://review.whamcloud.com/40875

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation        :       6328.334       5993.743       6113.977        151.946
   Directory stat            :     154744.046     148145.747     152434.570       3035.537
   Directory removal         :       4628.371       4174.092       4457.011        201.538
   Tree creation             :         13.789          1.132          7.503          5.167
   Tree removal              :          8.654          8.373          8.499          0.117

I think that patch 40875 solves the regression and the numbers are consistent.

Comment by Gerrit Updater [ 07/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40875/
Subject: LU-14172 lmv: optimize dir shard revalidate
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 94ec63ed67c6f09a2b15b2227ef6b189df623f4d

Comment by Gerrit Updater [ 09/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40863/
Subject: LU-14172 lmv: optimize dir shard revalidate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: de47c7671f29b2a3a79f6a126b7e01f0b2c5991a

Comment by Peter Jones [ 09/Dec/20 ]

Landed for 2.14 and 2.12.6

Generated at Sat Feb 10 03:07:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.