[LU-17276] Use interval tree in flock to improve the scalability Created: 09/Nov/23 Updated: 07/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Yang Sheng | Assignee: | Yang Sheng |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Running a test to exercise flock (N clients requesting a lock on different non-overlapping regions) showed high load on the MDS. There was previously a similar issue with regular extent locks (which are used to control data coherency for regular reads and writes) and adding an extra structure called interval tree reduced the load. Likely the same structure can be used for flock. FLOCKS_TEST 5: SET write 1000 flock(s) took 0.06s 16048.30/sec FLOCKS_TEST 5: SET write 2000 flock(s) took 0.14s 14526.76/sec FLOCKS_TEST 5: SET write 5000 flock(s) took 0.60s 8264.82/sec FLOCKS_TEST 5: SET write 10000 flock(s) took 2.94s 3401.57/sec FLOCKS_TEST 5: SET write 25000 flock(s) took 36.29s 688.98/sec FLOCKS_TEST 5: SET write 50000 flock(s) took 281.29s 177.75/sec FLOCKS_TEST 5: SET write 75000 flock(s) took 661.73s 113.34/sec Test case in patch https://review.whamcloud.com/53094 "LU-17276 tests: flock scalability test". |
| Comments |
| Comment by Gerrit Updater [ 06/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53346 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 13/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53447 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 18/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53495 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 25/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53551 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yang Sheng [ 26/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As record: The flock patch was separated in 3 parts: test result as below:
Note: The master == original lustre The new-wop == (without prealloc)patch#1+#3 The new == patch#1+#2+#3 The split means testing with split case(enqueue a lock then split it). We need alloc a new lock while a lock was split. eg.
exist lock-PR(1,10), enqueue a lock-PW(5,6)
then lock1(1,4);lock2(5,6);lock3(7,10)
The lock3 will be alloced finally.
But the res lock must be released while allocation. So we have to restart whole process from beginning after lock was taken again. I was thinking prealloc a lock before taken res lock to avoid release it in middle. After testing, looks like the side effect big than benefit. So the 2nd patch is useless anyway. It is really sad. BTW: Hi, Andreas, where should be placed for the flock performance testing case? sanity is obvious not.
Thanks, YangSheng | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 26/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like the "wop" tests are showing the best performance, which I agree is unexpected, especially in the "split" case where there will always need to be a new lock allocated each time. It might be interesting to see what happens when there are 250k or 500k locks on a resource? It might be that the overhead of allocating an extra lock is high compared to the cost of re-scanning the list if it is small, but the overhead gets higher when the list is large? It looks like the "1+2+3" times were getting closer to "1+2" with more locks? This might mean preallocating the lock when a resource has > 100k entries but not otherwise? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 27/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53558 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 27/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like switching to prealloc when >= 50,000 locks on a resource is probably a reasonable balance. That saves 113s for the contended/split case, while only slowing the uncontended case by 11s. With that many locks on a single object it is very likely that there will be conflicts between the locks. It might be possible to make this smarter by checking the lock type (eg. read is unlikely to split vs. write is more likely to split) and make a prealloc decision based on that, and/or keep a history of whether recent locks were conflicting or not. However, I expect the number of times this added complexity will be helpful is relatively low, since few applications will be flocking a single file so many times. The most important thing is to avoid the O(n^2) behavior of "master+split" that we have today. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yang Sheng [ 27/Dec/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, Andreas,
Since split could occur in all of scene except get_lock operation. Make decision base on lock number is reasonable, but it is hard to determine the certain value since different environment. I agree it is a rare case that flock one file many times. Also the split case is not too frequent in real world. Looks like not valuable to add the complexity for that. If you think it is still helpful in some cases, maybe we can make it as a optional feature control by proc?
Thanks, YangSheng | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 10/Jan/24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53558/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 15/Jan/24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53675 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 01/Feb/24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53881 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 07/Feb/24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53950 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 07/Feb/24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53951 |