[LU-4801] spin lock contention in lock_res_and_lock Created: 21/Mar/14 Updated: 22/Apr/21 Resolved: 22/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.4.0-26chaos |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13209 | ||||||||||||
| Description |
|
Our MDS experienced severe lock contention in lock_res_and_lock(). This had a large impact on client responsiveness because service threads were starved for CPU time. We have not yet identified the client workload that caused this problem. All active tasks had stack traces like this, but would eventually get scheduled out. ... __spin_lock lock_res_and_lock ldlm_handle_enqueue0 mdt_handle_common mds_regular_handle ptlrpc_server_handle_request ... This raises the question of why the ldlm resource lock needs to be a spinlock. Couldn't we avoid this issue by converting it to a mutex? This question was raised in |
| Comments |
| Comment by Peter Jones [ 22/Mar/14 ] |
|
Oleg Could you please comment? Thanks Peter |
| Comment by Oleg Drokin [ 23/Mar/14 ] |
|
Hm. As for the necessity of res lock to be a spinlock, it was supposed to only cover very fast code paths and spinlock was cheaper at the time. This might have changed and we need to have some internal discussions to see if it still makes sense or if it should really be converted to e.g. a mutex. |
| Comment by Alex Zhuravlev [ 28/Mar/14 ] |
|
if a spinlock consumes a lot of CPU, this mean high contention. in such a condition, a mutex will be causing schedule() very often, which isn't good? I'd think the root cause is that contention to be understood and addressed. |
| Comment by Christopher Morrone [ 03/Apr/14 ] |
|
Calling schedule() when there is a high level of contention is good. It allows the hundreds of threads that are not holding the lock to go to sleep, which allows the one guy who actually can make progress (the one holding the lock) to get the CPU and move forward with real work. A spin lock is good when contention is expected to be, relatively speaking, low and for short periods of time. |
| Comment by Oleg Drokin [ 04/Apr/14 ] |
|
Well, it really depends on how ssmall the lock region is. Since it was believed that the lock region in this case is quite short and small, it made no sense to use anything but spinlocks as schedule overhead would have been worse. There is another possible problem, btw. My understanding is that mutex can opportunistically spin in some cases when it encounters a busy lock. So it might be trading same for same. |
| Comment by Christopher Morrone [ 16/Apr/14 ] |
|
We need to develop a plan to make progress on this issue. |
| Comment by Liang Zhen (Inactive) [ 19/Apr/14 ] |
|
I think there are two things could be improved even w/o changing lock type here, although I don't know how much they can really help:
I had a patch to improve this (LU-30), for unknown reason this patch has been deleted from gerrit, but I still have it in my local repository which is based on 2.1, I will try to port it to 2.4. |
| Comment by Oleg Drokin [ 21/Apr/14 ] |
|
I have two patches here. At this point the patches underwent about an hour of my horrific testing and seems to be doing fine. In addition to that - we still need to better understand what is the worload that causes this issue for you, if you can share that. If you can catch a problem in progress and crashdump the node so that it's possible to examine resoources on mds server and see how long are the lock lists, what sort of locks are there and so on - that would also be great. |
| Comment by Liang Zhen (Inactive) [ 22/Apr/14 ] |
|
I also have ported my patch to 2.4 (http://review.whamcloud.com/#/c/10031/), it includes a lot more changes and needs more efforts to review, so I think Oleg's patch is the better choice for trying. |
| Comment by Ned Bass [ 25/Apr/14 ] |
|
I can sometimes reproduce this problem in our test environment. My test case is based on the observed production workload: two thousand tasks reading the same file in 256k chunks and taking a read flock on each chunk. I believe the locking may be done in the ROMIO I/O library. The production MDS is under heavy memory pressure, and when this workload runs we see many MDS service threads sleeping in cfs_alloc(). I used a modified version of IOR to simulate the workload: https://github.com/nedbass/ior/tree/flock. An option is added to do reads and/or writes under an flock. Then I create memory pressure on the MDS by allocating and writing lots of memory in user space. This doesn't work 100% of the time. I think the key is when service threads start using kmalloc() to satisfy ptlrpc send buffer allocations (as opposed to reusing buffers from the cache). Then they are forced to schedule and the load average quickly climbs to 300+. This is when I start seeing all CPUs spinning in lock_res_and_lock(). # Precreate a 1.6TB file. srun -p pbatch -N 2 ./src/ior -b 800g -e -k -l -N 2 -o /p/lcraterz/bass6/iorfile -w -t 1m # Run a 6400 task IOR to read the file using the flock option. srun -p pbatch -N 100 -n 6400 ./src/ior -b 256m -e -k -l -N 6400 -o /p/lcraterz/bass6/iorfile2 -r -t 256k -L r # Eat up memory on MDS. Usage: memhog [pages to alloc] [seconds to sleep] ./memhog $(( 18 * 1024 * 1024 * 1024 / 4096 )) 86400 |
| Comment by Oleg Drokin [ 30/Apr/14 ] |
|
Read flock! wow, that's really-really heavy stuff thats all computed under res lock, including deadlock detection too. I imagine you should see quite a bit of spinning with flocks on the same resource even with no memory pressure at all. work in Significantly redoing flock code fast is kind of hard, so I wonder if the replacement of spinlock with mutex will really help you here. |
| Comment by Oleg Drokin [ 22/Jul/14 ] |
|
Ok, here's a "flock to only use mutexes" patch: http://review.whamcloud.com/11171 - This one passes my testing. Additionally, I have refreshed http://review.whamcloud.com/10039 - this is a patch useful all in itself to reduce workload on server during lock granting Additionally, rereading Ned's comment on the reproducer - this entire "kmalloc kicks out the problem" thing is really strange. Even when threads block in allocation, they could not be doing it with a spinlock held - as this would cause all sorts of deadlocks. If you ever have time to play with this again, please dump a crashdump to check if there are no threads that are diving into an allocation with a spinlock held please? |
| Comment by Oleg Drokin [ 22/Jul/14 ] |
|
I guess an idea for a reproducer if it's just a long flock reprocessing issue is to have an multithreaded app run from a whole bunch of nodes where every thread would try to lock the same range (in blocking mode) in the same file to accumulate a multi-thousand list of blocked locks. Make whoever has the lock first to wait a minute or whatever amount of time is needed to ensure that all threads on all nodes has sent the flock requests and they all blocked. |
| Comment by Oleg Drokin [ 22/Jul/14 ] |
|
attached parallel_flock.c is my idea for a good reproducer of the flock issue. It takes three arguments: -f filename - filename on lustre fs to work on To reproduce - run on a machine with a lot of clients using all available cores too. While running, I imagine MDS should grind to a total halt so that even userspace barely responds if at all. This is untested code, I just made sure it compiles. |
| Comment by Oleg Drokin [ 23/Jul/14 ] |
|
Attaching parallel_flock_v2.c - this is the same as before, only this version actually works as expected. |
| Comment by Ned Bass [ 23/Jul/14 ] |
|
Oleg, to clarify my comment regarding threads blocking in kmalloc(), I don't mean that they are doing so while holding a spinlock. My theory is that they spent so much time under the spin lock that they become eligible to reschedule. When they later call kmalloc() outside the spinlock it internally calls might_sleep() and reschedules the thread. Because there are other threads actively contending in lock_res_and_lock(), those blocked threads get starved for CPU time. |
| Comment by Oleg Drokin [ 23/Jul/14 ] |
|
Ned, I see. |
| Comment by Ned Bass [ 24/Jul/14 ] |
|
Oleg, it may not be right away, but we'll get this scheduled for testing. |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
Neil pushed some patches to address. I will push |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35483 |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35484 |
| Comment by Gerrit Updater [ 14/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35483/ |
| Comment by Gerrit Updater [ 15/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38238 |
| Comment by Gerrit Updater [ 15/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38238/ |
| Comment by Gerrit Updater [ 03/Sep/20 ] |
|
Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/39811 |
| Comment by Gerrit Updater [ 21/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39811/ |
| Comment by James A Simmons [ 22/Apr/21 ] |
|
Patch has landed. We will look to deploy this on our production systems soon. Any further work needed we can reopen this ticket. |