[LU-9796] Speedup file creation under heavy concurrency - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.11.0, Lustre 2.10.5
Affects Version/s: None
Labels:
- patch

Rank (Obsolete):
9223372036854775807

Description

In general, there are some metadata performance regression with/without quota.
Now, Project quota introduced, it's time to measure metadata and improve metadata performance when quota is enabled.
Also, it seems upstream kernel has some performance optimizations for ext4 when quota enabled. It might be possible to get those optimization for Lustre.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-9796.png
77 kB
22/Sep/17 1:44 PM
metadata-performance-upstreamkernel.xlsx
23 kB
25/Jul/17 1:34 AM
Performance-difference-Quota.xlsx
34 kB
25/Jul/17 1:32 AM

Activity

[LU-9796] Speedup file creation under heavy concurrency

Gerrit Updater added a comment - 04/Jan/18 2:48 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28276/
Subject: ~~LU-9796~~ kernel: improve metadata performaces for RHEL7
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 17fe3c192e101ace75b2f4d7f7e9ff7d8d85480e

Gerrit Updater added a comment - 04/Jan/18 2:48 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28276/ Subject: LU-9796 kernel: improve metadata performaces for RHEL7 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 17fe3c192e101ace75b2f4d7f7e9ff7d8d85480e

Bob Glossman (Inactive) added a comment - 27/Nov/17 9:49 PM

I see recent mods in master have landed for this issue, but only for RHEL 7.x
Is this not needed for RHEL 6.x, SLES 11/12, Ubuntu ?

Bob Glossman (Inactive) added a comment - 27/Nov/17 9:49 PM I see recent mods in master have landed for this issue, but only for RHEL 7.x Is this not needed for RHEL 6.x, SLES 11/12, Ubuntu ?

Peter Jones added a comment - 22/Nov/17 2:53 PM

Does the remaining patch need to land too or can that be abandoned and this ticket marked resolved?

Peter Jones added a comment - 22/Nov/17 2:53 PM Does the remaining patch need to land too or can that be abandoned and this ticket marked resolved?

Gerrit Updater added a comment - 22/Nov/17 3:55 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29032/
Subject: ~~LU-9796~~ ldiskfs: improve inode allocation performance
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f0a7241c434d9556308299eea069628715816c2

Gerrit Updater added a comment - 22/Nov/17 3:55 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29032/ Subject: LU-9796 ldiskfs: improve inode allocation performance Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3f0a7241c434d9556308299eea069628715816c2

Shuichi Ihara (Inactive) added a comment - 22/Sep/17 1:41 PM - edited

attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png
1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory)
1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD)
Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3

patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

Shuichi Ihara (Inactive) added a comment - 22/Sep/17 1:41 PM - edited attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png 1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory) 1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD) Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3 patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

Gerrit Updater added a comment - 16/Sep/17 2:06 PM

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032
Subject: ~~LU-9796~~ ldiskfs: improve inode allocation performace
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f94e45786dc92f9ec048908234153a547d64462

Gerrit Updater added a comment - 16/Sep/17 2:06 PM Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032 Subject: LU-9796 ldiskfs: improve inode allocation performace Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f94e45786dc92f9ec048908234153a547d64462

Patrick Farrell (Inactive) added a comment - 08/Aug/17 3:28 PM

Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)

Patrick Farrell (Inactive) added a comment - 08/Aug/17 3:28 PM Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)

Wang Shilong (Inactive) added a comment - 08/Aug/17 12:15 AM - edited

Hello Patrick,

I have refreshed the patch to use new approach which looks better for me.
https://patchwork.ozlabs.org/patch/799014/

I don't like a timed sleep too.^_

Wang Shilong (Inactive) added a comment - 08/Aug/17 12:15 AM - edited Hello Patrick, I have refreshed the patch to use new approach which looks better for me. https://patchwork.ozlabs.org/patch/799014/ I don't like a timed sleep too. _

Patrick Farrell (Inactive) added a comment - 07/Aug/17 4:30 PM

Ah, sorry for my misunderstanding.

So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.)

I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe.

Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know...

But inserting a timed sleep...

Patrick Farrell (Inactive) added a comment - 07/Aug/17 4:30 PM Ah, sorry for my misunderstanding. So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.) I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe. Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know... But inserting a timed sleep...

Shuichi Ihara (Inactive) added a comment - 05/Aug/17 1:17 AM - edited

Patrick,
What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right?
Good to know, we didn't aware of it, but that was enabled in our all upstream testing.

# grep -i spinlocks .config 
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
# Lock Debugging (spinlocks, mutexes, etc...)

But we saw same contentions even with that is enabled.

Shuichi Ihara (Inactive) added a comment - 05/Aug/17 1:17 AM - edited Patrick, What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right? Good to know, we didn't aware of it, but that was enabled in our all upstream testing. # grep -i spinlocks .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # Lock Debugging (spinlocks, mutexes, etc...) But we saw same contentions even with that is enabled.

Wang Shilong (Inactive) added a comment - 04/Aug/17 11:33 PM

Hello Patrick,

We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit
same lock contention problem, and patch consistently improved the performace, at least there is
some benchmark numbers in patch changelog that you did not notice...

Thanks,
Shilong

Wang Shilong (Inactive) added a comment - 04/Aug/17 11:33 PM Hello Patrick, We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit same lock contention problem, and patch consistently improved the performace, at least there is some benchmark numbers in patch changelog that you did not notice... Thanks, Shilong

People

Assignee:: Hongchao Zhang

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 25/Jul/17 1:31 AM

Updated:: 10/Sep/18 7:11 PM

Resolved:: 07/May/18 11:43 PM