[LU-9796] Speedup file creation under heavy concurrency Created: 25/Jul/17  Updated: 10/Sep/18  Resolved: 07/May/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.5

Type: Improvement Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: patch

Attachments: PNG File LU-9796.png     Microsoft Word Performance-difference-Quota.xlsx     Microsoft Word metadata-performance-upstreamkernel.xlsx    
Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

In general, there are some metadata performance regression with/without quota.
Now, Project quota introduced, it's time to measure metadata and improve metadata performance when quota is enabled.
Also, it seems upstream kernel has some performance optimizations for ext4 when quota enabled. It might be possible to get those optimization for Lustre.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 25/Jul/17 ]

https://jira.hpdd.intel.com/secure/attachment/27819/Performance-difference-Quota.xlsx
this is metadata performance results. (no quota, user quota and user/project quota)

Comment by Shuichi Ihara (Inactive) [ 25/Jul/17 ]

https://jira.hpdd.intel.com/secure/attachment/27820/metadata-performance-upstreamkernel.xlsx
metadata performance test resutls on ext4 with upstream kernel.

Comment by Wang Shilong (Inactive) [ 25/Jul/17 ]

Hi,
We did some tracing with/without quota enabled on ext4 filesystem, here is some top cost
differences:

This is without quota enabled.

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
ext4_create                    7726432              320000       24.15       
ext4_add_nondir                3504535              320000       10.95       
ext4_add_entry                 2914804              320002       9.11        
ext4_mark_inode_dirty          2419282              1286364      1.88        
ext4_lookup                    1563878              320001       4.89        
ext4_find_entry                1511832              320001       4.72        
ext4_dx_find_entry             1459249              319830       4.56        
ext4_reserve_inode_write       1362568              1286364      1.06        
ext4_bread                     1337678              1816657      0.74        
jbd2_journal_get_write_access  1286821              2582081      0.50        
ext4_getblk                    1053137              1816828      0.58        
ext4_mark_iloc_dirty           847583               1286364      0.66        
ext4_ext_tree_init             697111               320002       2.18        
ext4_map_blocks                686059               1842022      0.37  

With quota enabled results:

^C
FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
ext4_create                    9890855              320000       30.91       
ext4_add_nondir                3664890              320000       11.45       
ext4_add_entry                 3059475              320002       9.56        
ext4_mark_inode_dirty          2428471              1286392      1.89        
ext4_bread                     1854161              2463799      0.75        
jbd2_journal_get_write_access  1689248              3228565      0.52        
ext4_mark_dquot_dirty          1658410              646386       2.57        
ext4_lookup                    1562224              320001       4.88        
ext4_write_dquot               1528050              646386       2.36        
ext4_find_entry                1508029              320001       4.71        
ext4_dx_find_entry             1457535              319830       4.56        
ext4_getblk                    1454993              2463970      0.59        
ext4_reserve_inode_write       1370951              1286392      1.07        
ext4_quota_write               1158875              646386       1.79        
ext4_map_blocks                957721               2486139      0.39        
ext4_mark_iloc_dirty           845593               1286392      0.66        
jbd2_journal_dirty_metadata    716382               2908737      0.25        
ext4_ext_tree_init             710756               320002       2.22     

It looks journal quota write affect performances, we tried to disable
ext4_mark_dquot_dirty() call, performaces was mostly back.

Comment by Peter Jones [ 28/Jul/17 ]

Hongchao

Can you please assist with this one?

Thanks

Peter

Comment by Gerrit Updater [ 30/Jul/17 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/28276
Subject: LU-9796 kernel: improve metadata performaces for RHEL7
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b5595fc5ccbae605917ae77a2990a6acc4fa29a7

Comment by Shuichi Ihara (Inactive) [ 01/Aug/17 ]

patch https://review.whamcloud.com/28276 improves performance on Lustre, but cause performance regression with mds-survey beoucse of a lot of _raw_spin_lock() calls.

Without patch

[root@mds04 lustre-release-ee-ddn]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey
Tue Aug  1 16:13:54 JST 2017 /usr/bin/mds-survey from mds04
mdt 1 file 2000000 dir   48 thr   48 create 119346.25 [ 38996.65, 143993.95]

WIth patch

[root@mds04 ~]#  tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey
Tue Aug  1 17:01:32 JST 2017 /usr/bin/mds-survey from mds04
mdt 1 file 2000000 dir   48 thr   48 create 35689.72 [ 24997.45, 46995.91] 

perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' found difference of function costs with and without patch.

without patch

perf-tools/bin/funccost  '_raw_spin_lock*,ldiskfs*,jbd2*'

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
ldiskfs_create_inode           668090967            2000016      334.04      
jbd2_journal_get_write_access  660607521            28364899     23.29       
ldiskfs_dx_add_entry           76861625             1994470      38.54       
ldiskfs_mark_inode_dirty       31383256             8050242      3.90        
ldiskfs_reserve_inode_write    30442931             12050274     2.53        
ldiskfs_mark_dquot_dirty       29298284             4051620      7.23        
ldiskfs_write_dquot            27393055             4051620      6.76        
ldiskfs_xattr_trusted_set      25413629             4000032      6.35        
ldiskfs_xattr_set              24943944             4000032      6.24        
ldiskfs_xattr_set_handle       22973138             4000032      5.74        
jbd2_journal_dirty_metadata    20879428             23877421     0.87        
ldiskfs_bread                  18514454             15808891     1.17        
ldiskfs_mark_iloc_dirty        18151829             12050274     1.51        
ldiskfs_get_inode_loc          16463367             18050473     0.91        
ldiskfs_dirty_inode            15518054             4025922      3.85        
ldiskfs_getblk                 15396652             15816550     0.97        
_raw_spin_lock_irqsave         13349523             41367053     0.32        
jbd2_journal_put_journal_head  10369726             52749533     0.20        
ldiskfs_find_entry             10361625             2002129      5.18        
ldiskfs_dx_find_entry          10064239             1994470      5.05        
ldiskfs_quota_write            9133495              3600359      2.54        
_raw_spin_lock                 8851352              104218345    0.08    

with patch

perf-tools/bin/funccost  '_raw_spin_lock*,ldiskfs*,jbd2*'

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
ldiskfs_create_inode           2510051302           2000016      1255.02     
_raw_spin_lock                 2382710363           191888590    12.42       
ldiskfs_read_inode_bitmap      57616803             2002599      28.77       
jbd2_journal_get_write_access  44399656             103374244    0.43        
ldiskfs_dx_add_entry           42191184             1994470      21.15       
ldiskfs_bread                  31150135             16197888     1.92        
ldiskfs_mark_inode_dirty       19684052             8050194      2.45        
ldiskfs_mark_dquot_dirty       18764931             4051620      4.63        
ldiskfs_write_dquot            17041111             4051620      4.21        
ldiskfs_xattr_trusted_set      14533087             4000032      3.63        
ldiskfs_reserve_inode_write    14086373             12050226     1.17        
ldiskfs_xattr_set              14036664             4000032      3.51        
ldiskfs_getblk                 13523271             16205547     0.83        
ldiskfs_xattr_set_handle       11857052             4000032      2.96        
ldiskfs_mark_iloc_dirty        11561327             12050226     0.96        
ldiskfs_get_inode_loc          11498196             18050426     0.64        
ldiskfs_find_entry             10792485             2002129      5.39    

Comment by Shuichi Ihara (Inactive) [ 01/Aug/17 ]

After more investigation, Shilong made a prototype patch to reduce contenction with _raw_spin_lock(). He can submit patch sonner.
Without patch (Unique Directory)

#  for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \
dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \
dir_count=48 mds-survey; done
	create       destroy
1 	135,864 	241,262 
2 	137,432 	245,305 
3 	135,347 	228,004 
4 	135,641 	223,845 
5 	137,300 	242,719 

With patch (Unique Directory)

#  for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \
dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \
dir_count=48 mds-survey; done
          create	destroy
1 	223,178 	273,882 
2 	181,892 	269,771 
3 	202,841 	235,435 
4 	195,022 	220,466 
5 	193,564 	263,998

we are getting 50% perforamnce improvment for creation for now.

Comment by Wang Shilong (Inactive) [ 02/Aug/17 ]

upstream ext4 patch could be seen here:
http://marc.info/?l=linux-ext4&m=150164892405614&w=2

Comment by Patrick Farrell (Inactive) [ 04/Aug/17 ]

Shilong,

I'm not on the ext4 list so I won't comment there, but that code should almost certainly use ext4_fs_is_busy rather than counting attempts directly.

Separately, did you try that patch on the current upstream kernel?

Is the problem spinlock contention (most time spent in lock/unlock, not just waiting for the lock - a problem which is fixed in newer kernels) or is it actually waiting for the lock? (Most time spent waiting for lock, because it really is held.)

Anyway, if you haven't, you should try this on a newer kernel - Spinlock contention as a peformance problem is more or less fixed with queued spinlocks. (Multiple waiters for a spinlock now has minimal performance impact on lock/unlock, whereas in earlier kernels, multiple waiters cause locking and unlocking to take many times longer. It won't fix the problem of having to wait for the lock, but removes lock contention itself as the cause of performance issues.) RHEL7 doesn't have them, sadly.

If your problem really is spinlock contention, it's not going to show up on newer kernels. We might still want the patch for RHEL7.

Comment by Wang Shilong (Inactive) [ 04/Aug/17 ]

Hello Patrick,

We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit
same lock contention problem, and patch consistently improved the performace, at least there is
some benchmark numbers in patch changelog that you did not notice...

Thanks,
Shilong

Comment by Shuichi Ihara (Inactive) [ 05/Aug/17 ]

Patrick,
What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right?
Good to know, we didn't aware of it, but that was enabled in our all upstream testing.

# grep -i spinlocks .config 
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
# Lock Debugging (spinlocks, mutexes, etc...)

But we saw same contentions even with that is enabled.

Comment by Patrick Farrell (Inactive) [ 07/Aug/17 ]

Ah, sorry for my misunderstanding.

So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.)

I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe.

Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know...

But inserting a timed sleep...

Comment by Wang Shilong (Inactive) [ 08/Aug/17 ]

Hello Patrick,

I have refreshed the patch to use new approach which looks better for me.
https://patchwork.ozlabs.org/patch/799014/

I don't like a timed sleep too._

Comment by Patrick Farrell (Inactive) [ 08/Aug/17 ]

Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)

Comment by Gerrit Updater [ 16/Sep/17 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032
Subject: LU-9796 ldiskfs: improve inode allocation performace
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f94e45786dc92f9ec048908234153a547d64462

Comment by Shuichi Ihara (Inactive) [ 22/Sep/17 ]

attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png
1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory)
1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD)
Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3

patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

Comment by Gerrit Updater [ 22/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29032/
Subject: LU-9796 ldiskfs: improve inode allocation performance
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f0a7241c434d9556308299eea069628715816c2

Comment by Peter Jones [ 22/Nov/17 ]

Does the remaining patch need to land too or can that be abandoned and this ticket marked resolved?

Comment by Bob Glossman (Inactive) [ 27/Nov/17 ]

I see recent mods in master have landed for this issue, but only for RHEL 7.x
Is this not needed for RHEL 6.x, SLES 11/12, Ubuntu ?

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28276/
Subject: LU-9796 kernel: improve metadata performaces for RHEL7
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 17fe3c192e101ace75b2f4d7f7e9ff7d8d85480e

Comment by Peter Jones [ 04/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 19/Mar/18 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/31683
Subject: Revert "LU-9796 kernel: improve metadata performaces for RHEL7"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4232b2842ff720791f2afb55b80ba6640142f624

Comment by Gerrit Updater [ 19/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31683/
Subject: Revert "LU-9796 kernel: improve metadata performaces for RHEL7"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2083ffd1bc6c772972834b50e5aef2118c88658d

Comment by Patrick Farrell (Inactive) [ 19/Mar/18 ]

Are there any details on the failures seen, like an LU tracking them or anything?

Comment by Peter Jones [ 19/Mar/18 ]

No. AFAIK these are unconfirmed suspicions from support situations. We're just erring on the side of caution for the time being.

Comment by Patrick Farrell (Inactive) [ 19/Mar/18 ]

OK, thanks!  I'll keep an eye on this bug, then.

Comment by Gerrit Updater [ 04/May/18 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/32295
Subject: LU-9796 ldiskfs: improve inode allocation performance
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: e5e6c26a0f3137f4dbd2bb300a437116c957727d

Comment by Minh Diep [ 07/May/18 ]

wangshilong, are you planning to submit any more patch since the #31683 reverted?

Comment by Wang Shilong (Inactive) [ 07/May/18 ]

Nope, you could close the ticket now.

Comment by Peter Jones [ 07/May/18 ]

ok 0 thanks

Comment by Gerrit Updater [ 11/Jun/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32295/
Subject: LU-9796 ldiskfs: improve inode allocation performance
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: f27584430fc8b1379a4f6f064b9b201da8deec92

Comment by Lukasz Flis [ 10/Sep/18 ]

I am sorry for commenting in the "resolved" issue.

This patch is not included in rhel7.5 builds  - is it desired ?

# checking with sources from 3.10.0-862.9.1.el7 patched kernel SRPM
[root@kernel-builder linux-3.10.0-862.9.1.el7]# patch -p1 < ext4-reduce-lock-contention-in-__ext4_new_inode.patch 
patching file fs/ext4/ialloc.c
Hunk #1 succeeded at 698 (offset -5 lines).
Hunk #2 FAILED at 840.
Hunk #3 succeeded at 873 (offset -1 lines).
1 out of 3 hunks FAILED -- saving rejects to file fs/ext4/ialloc.c.rej

[root@kernel-builder linux-3.10.0-862.9.1.el7]# patch -p1 < ext4-cleanup-goto-next-group.patch
patching file fs/ext4/ialloc.c
Hunk #1 succeeded at 815 (offset 40 lines).
Hunk #2 succeeded at 838 (offset 40 lines).

I can't see these fixes in 7.5 kernel changelog 

 

Generated at Sat Feb 10 02:29:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.