[LU-12151] metadata performance difference on root and non-root user Created: 03/Apr/19  Updated: 11/Feb/20  Resolved: 13/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-2.10.5-RC2/ldiskfs


Issue Links:
Duplicate
Related
is related to LU-13239 ldiskfs: pass initial inode attribute... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We found a huge performance difference on file creation with root user and non-root user .

  • 1 x MDS(1 x Platinum 8160, 96GB memory, EDR)
  • 32 x client(2 x E5-2650 v4, 128GB memory, EDR)
  • 1 x ES14K (40 x SSD)

root user

[root@c01 ~]# salloc -N 32 --ntasks-per-node=20 mpirun --allow-run-as-root /work/tools/bin/mdtest -n 1000 -F -v -u -d /scratch0/bmuser/ -C
SUMMARY: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     151328.449     151328.449     151328.449          0.000
   File stat         :          0.000          0.000          0.000          0.000
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :         42.057         42.057         42.057          0.000
   Tree removal      :          0.000          0.000          0.000          0.000
V-1: Entering timestamp...

Non-root user

[bmuser@c01 ~]$ salloc -N 32 --ntasks-per-node=20 mpirun --allow-run-as-root /work/tools/bin/mdtest -n 1000 -F -v -u -d /scratch0/bmuser/ -C
SUMMARY: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     102825.662     102825.662     102825.662          0.000
   File stat         :          0.000          0.000          0.000          0.000
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :         30.589         30.589         30.589          0.000
   Tree removal      :          0.000          0.000          0.000          0.000
V-1: Entering timestamp...

150K (root) vs 100K (non-root)



 Comments   
Comment by Shuichi Ihara [ 03/Apr/19 ]

it seems related to. Just quick a hack to disable quota acounting for non-root user, performance is back.


diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
index 060cbb8..8f68d91 100644
--- a/lustre/osd-ldiskfs/osd_handler.c
+++ b/lustre/osd-ldiskfs/osd_handler.c
@@ -2631,6 +2631,8 @@ static int osd_quota_transfer(struct inode *inode, const struct lu_attr *attr)
 {
 int rc;
 
+ return 0;
+
 if ((attr->la_valid & LA_UID && attr->la_uid != i_uid_read(inode)) ||
 (attr->la_valid & LA_GID && attr->la_gid != i_gid_read(inode))) {
 struct iattr iattr;

[bmuser@c01 ~]$ salloc -N 32 --ntasks-per-node=20 mpirun --allow-run-as-root /work/tools/bin/mdtest -n 1000 -F -v -u -d /scratch0/bmuser/ -C
SUMMARY: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     151046.528     151046.528     151046.528          0.000
   File stat         :          0.000          0.000          0.000          0.000
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :         17.299         17.299         17.299          0.000
   Tree removal      :          0.000          0.000          0.000          0.000
V-1: Entering timestamp...
Comment by Alex Zhuravlev [ 03/Apr/19 ]

was quota enforcement enabled?

Comment by Wang Shilong (Inactive) [ 03/Apr/19 ]

Alex,

I guess Ihara's test quota enforcement is not enabled.

But the problem is for non-root user, uid/gid is 0 for precreation and we need transfer space accounting for it, which we hit some lock bottleneck here, I had a patch locally but did not push and confirm testing yet.

Comment by Shuichi Ihara [ 03/Apr/19 ]

correct. quota slave ("-O quota") was enabled (deafult), but no quota enfocement enabled.

Comment by Alex Zhuravlev [ 03/Apr/19 ]

sorry, don't understand - we do not change uid/gid for precreated objects in create path?

Comment by Wang Shilong (Inactive) [ 03/Apr/19 ]

Alex,

oops, OST object uid/gid only changed at first write, I think the problem existed for MDS
That might be because of following reasons:

|->osd_create
  |->osd_create_type_f
     |->osd_mkreg
        |->ldiskfs_create_inode
           |->ext4_new_inode() which we pass owner as NULL which we will create 0 as uid/gid
      |->osd_attr_init
         |->osd_quota_transfer  ---->which will change uid/gid again for above.

I think efficient way might be we pass owner down to ldiskfs_create_inode(), which is more efficient
than we transfer it later?

Comment by Gerrit Updater [ 03/Apr/19 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/34581
Subject: LU-12151 osd-ldiskfs: pass owner down rather than transfer it
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e532ad765df8a97d9a946236648144efea719ca5

Comment by Shuichi Ihara [ 03/Apr/19 ]

Here is current file creation speed on master branch with root and non-root user.
170K ops/sec (root user) vs 100K ops/sec (non-root user) for file creation.

[root@c01 ~]# id
uid=0(root) gid=0(root) groups=0(root)
[root@c01 ~]# salloc -N 32 --ntasks-per-node=24 mpirun -np 768 --allow-run-as-root /work/tools/bin/mdtest -n 2000 -F -u -d /cache1/mdt0
salloc: Granted job allocation 6045
-- started at 04/03/2019 18:55:10 --

mdtest-1.9.3 was launched with 768 total task(s) on 32 node(s)
Command line used: /work/tools/bin/mdtest "-n" "2000" "-F" "-u" "-d" "/cache1/mdt0"
Path: /cache1
FS: 3.9 TiB   Used FS: 0.0%   Inodes: 160.0 Mi   Used Inodes: 0.0%

768 tasks, 1536000 files

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     175749.654     175738.010     175741.938          1.705
   File stat         :     495658.996     495619.768     495634.739          6.634
   File read         :     257464.620     257412.150     257446.462         12.140
   File removal      :     197592.306     197444.295     197539.519         51.355
   Tree creation     :         51.695         51.695         51.695          0.000
   Tree removal      :         14.876         14.876         14.876          0.000

[sihara@c01 ~]$ id
uid=10000(sihara) gid=100(users) groups=100(users)
[sihara@c01 ~]$ salloc  -N 32 --ntasks-per-node=24 mpirun -np 768 /work/tools/bin/mdtest -n 2000 -F -u -d /cache1/mdt0
salloc: Granted job allocation 6043
-- started at 04/03/2019 18:44:27 --

mdtest-1.9.3 was launched with 768 total task(s) on 32 node(s)
Command line used: /work/tools/bin/mdtest "-n" "2000" "-F" "-u" "-d" "/cache1/mdt0"
Path: /cache1
FS: 3.9 TiB   Used FS: 0.0%   Inodes: 160.0 Mi   Used Inodes: 0.0%

768 tasks, 1536000 files

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     108634.397     108630.106     108631.673          0.614
   File stat         :     468761.147     468723.486     468736.693          6.927
   File read         :     261685.099     261646.894     261671.608          8.099
   File removal      :     180895.760     180851.349     180876.868          9.373
   Tree creation     :         61.624         61.624         61.624          0.000
   Tree remova

 

After apply patch https://review.whamcloud.com/34581 , non-root user is able to get same file creation rate as root user. 

[sihara@c01 ~]$ id
uid=10000(sihara) gid=100(users) groups=100(users)
[sihara@c01 ~]$ salloc  -N 32 --ntasks-per-node=24 mpirun -np 768 /work/tools/bin/mdtest -n 2000 -F -u -d /cache1/mdt0
salloc: Granted job allocation 6048
-- started at 04/03/2019 19:11:49 --

mdtest-1.9.3 was launched with 768 total task(s) on 32 node(s)
Command line used: /work/tools/bin/mdtest "-n" "2000" "-F" "-u" "-d" "/cache1/mdt0"
Path: /cache1
FS: 3.9 TiB   Used FS: 0.0%   Inodes: 160.0 Mi   Used Inodes: 0.0%

768 tasks, 1536000 files

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     185227.246     185213.609     185218.466          2.187
   File stat         :     472370.658     472306.853     472329.189         13.733
   File read         :     262557.843     262528.916     262540.418          8.698
   File removal      :     177183.588     176814.351     177165.183         25.934
   Tree creation     :         43.364         43.364         43.364          0.000
   Tree removal      :         13.871         13.871         13.871          0.000
 

And, no regressions found with root user too. (Just in case)

[root@c01 ~]# id
uid=0(root) gid=0(root) groups=0(root)
[root@c01 ~]# salloc -N 32 --ntasks-per-node=24 mpirun -np 768 --allow-run-as-root /work/tools/bin/mdtest -n 2000 -F -u -d /cache1/mdt0
salloc: Granted job allocation 6050
-- started at 04/03/2019 19:14:49 --

mdtest-1.9.3 was launched with 768 total task(s) on 32 node(s)
Command line used: /work/tools/bin/mdtest "-n" "2000" "-F" "-u" "-d" "/cache1/mdt0"
Path: /cache1
FS: 3.9 TiB   Used FS: 0.0%   Inodes: 160.0 Mi   Used Inodes: 0.0%

768 tasks, 1536000 files

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :     184781.517     184747.142     184775.286          4.355
   File stat         :     471423.053     471288.526     471350.414         16.800
   File read         :     259265.668     259197.540     259250.629         12.143
   File removal      :     180106.410     180034.379     180086.385         10.014
   Tree creation     :         45.413         45.413         45.413          0.000
   Tree removal      :         13.507         13.507         13.507          0.000
Comment by Wang Shilong (Inactive) [ 03/Apr/19 ]

Thanks Ihara for testing the patch, will include the results into patch commit.

Comment by Gerrit Updater [ 13/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34581/
Subject: LU-12151 osd-ldiskfs: pass owner down rather than transfer it
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 697f2d95bfdca13565ccc5d50e106114604c1724

Comment by Peter Jones [ 13/Apr/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 16/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34685
Subject: LU-12151 osd-ldiskfs: pass owner down rather than transfer it
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a6996b311a4c852a8eeb68b684c046d00fbac127

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34685/
Subject: LU-12151 osd-ldiskfs: pass owner down rather than transfer it
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f3d83215acd79ad062d3c605ca7dc8ba373be65d

Comment by Andrew Perepechko [ 29/May/19 ]

Passing xtimes (even with as low as 1 s resolution) can sometimes be beneficial as well: https://github.com/Xyratex/lustre-stable/commit/7ab00b00eb057f6963c0b5641686240ef95e1388#diff-89ce3dab611fea06ce62efa5bed4ae63

Comment by Wang Shilong (Inactive) [ 29/May/19 ]

Hi Andrew Perepechko,

Yup, you guys have similar optimizations three years ago It is a pity that our Lustre upstream did not have similar thing for a long time.

Passing down xtime down could avoid us an extra ext4 inode dirty operation(which reduce jbd2 memory operations) even not huge improvements like this uid/gid but deserve us to do.

Do you agree a separate ticket for that?

Thank you,
Shilong

Comment by Andrew Perepechko [ 29/May/19 ]

Hi Wang Shilong,

unfortunately the patch dropped from the porting list and was forgotten for a while.

I'll measure how the xtime optimization improves performance in addition to the owner optimization and open a new ticket.

Are you ok with that?

Thank you

Comment by Wang Shilong (Inactive) [ 29/May/19 ]

Hi,

Yup, that would be nice.

Thank you,
Shilong

Generated at Sat Feb 10 02:50:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.