[LU-4801] spin lock contention in lock_res_and_lock Created: 21/Mar/14  Updated: 22/Apr/21  Resolved: 22/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.4.0-26chaos


Attachments: File parallel_flock.c     File parallel_flock_v2.c    
Issue Links:
Related
is related to LU-3504 MDS: All cores spinning on ldlm lock ... Resolved
is related to LU-12542 LDLM improvements form linux lustre c... Resolved
Severity: 3
Rank (Obsolete): 13209

 Description   

Our MDS experienced severe lock contention in lock_res_and_lock(). This had a large impact on client responsiveness because service threads were starved for CPU time. We have not yet identified the client workload that caused this problem. All active tasks had stack traces like this, but would eventually get scheduled out.

 ...
__spin_lock
lock_res_and_lock
ldlm_handle_enqueue0
mdt_handle_common
mds_regular_handle
ptlrpc_server_handle_request
...

This raises the question of why the ldlm resource lock needs to be a spinlock. Couldn't we avoid this issue by converting it to a mutex? This question was raised in LU-3504.



 Comments   
Comment by Peter Jones [ 22/Mar/14 ]

Oleg

Could you please comment?

Thanks

Peter

Comment by Oleg Drokin [ 23/Mar/14 ]

Hm.
I guess this could happen when there's a long waiting list in the lock resource, then attempt to add there would take long time iterating and at the same time holding the lock.
Typically this would happen when you have jobs that try to e.g. create the same file from many nodes (hello fortran open, I guess). Other scenarios are also possible, I imagine.
I imagine all threads could not have strack traces like that, at leat one must have something else within a protected region.

As for the necessity of res lock to be a spinlock, it was supposed to only cover very fast code paths and spinlock was cheaper at the time. This might have changed and we need to have some internal discussions to see if it still makes sense or if it should really be converted to e.g. a mutex.

Comment by Alex Zhuravlev [ 28/Mar/14 ]

if a spinlock consumes a lot of CPU, this mean high contention. in such a condition, a mutex will be causing schedule() very often, which isn't good? I'd think the root cause is that contention to be understood and addressed.

Comment by Christopher Morrone [ 03/Apr/14 ]

Calling schedule() when there is a high level of contention is good. It allows the hundreds of threads that are not holding the lock to go to sleep, which allows the one guy who actually can make progress (the one holding the lock) to get the CPU and move forward with real work.

A spin lock is good when contention is expected to be, relatively speaking, low and for short periods of time.

Comment by Oleg Drokin [ 04/Apr/14 ]

Well, it really depends on how ssmall the lock region is. Since it was believed that the lock region in this case is quite short and small, it made no sense to use anything but spinlocks as schedule overhead would have been worse.
Of course we knoe that with long lock queues it takes longer to iterate them.

There is another possible problem, btw. My understanding is that mutex can opportunistically spin in some cases when it encounters a busy lock. So it might be trading same for same.

Comment by Christopher Morrone [ 16/Apr/14 ]

We need to develop a plan to make progress on this issue.

Comment by Liang Zhen (Inactive) [ 19/Apr/14 ]

I think there are two things could be improved even w/o changing lock type here, although I don't know how much they can really help:

  • it is unnecessary to always create ldlm_lock in ldlm_handle_enqueue0() for MDT stack, MDT intent policy almost always create its own lock. Although this redundant lock will never be granted, but destroy of it will increase chance of spin on res_lock (see ldlm_lock_destroy)
  • ldlm_handle_enqueue0() will always call ldlm_reprocess_all() if ELDLM_LOCK_REPLACED is returned from ns_policy, I suspect it's unnecessary as well, I actually think ldlm_reprocess_all() in ldlm_handle_enqueue0() is just legacy code (for ELDLM_LOCK_CHANGED which is not used by anyone now?) and can be removed from common code path.

I had a patch to improve this (LU-30), for unknown reason this patch has been deleted from gerrit, but I still have it in my local repository which is based on 2.1, I will try to port it to 2.4.

Comment by Oleg Drokin [ 21/Apr/14 ]

I have two patches here.
One is purely experimental to replace server side ldlm resource lock with mutex: http://review.whamcloud.com/10038
Another is to address item #2 from Liang's post, I thin kit does not make sense to always reprocess at the end of server side enqueue indeed, patch is at http://review.whamcloud.com/10039

At this point the patches underwent about an hour of my horrific testing and seems to be doing fine.

In addition to that - we still need to better understand what is the worload that causes this issue for you, if you can share that. If you can catch a problem in progress and crashdump the node so that it's possible to examine resoources on mds server and see how long are the lock lists, what sort of locks are there and so on - that would also be great.

Comment by Liang Zhen (Inactive) [ 22/Apr/14 ]

I also have ported my patch to 2.4 (http://review.whamcloud.com/#/c/10031/), it includes a lot more changes and needs more efforts to review, so I think Oleg's patch is the better choice for trying.

Comment by Ned Bass [ 25/Apr/14 ]

I can sometimes reproduce this problem in our test environment. My test case is based on the observed production workload: two thousand tasks reading the same file in 256k chunks and taking a read flock on each chunk. I believe the locking may be done in the ROMIO I/O library. The production MDS is under heavy memory pressure, and when this workload runs we see many MDS service threads sleeping in cfs_alloc().

I used a modified version of IOR to simulate the workload: https://github.com/nedbass/ior/tree/flock. An option is added to do reads and/or writes under an flock. Then I create memory pressure on the MDS by allocating and writing lots of memory in user space. This doesn't work 100% of the time. I think the key is when service threads start using kmalloc() to satisfy ptlrpc send buffer allocations (as opposed to reusing buffers from the cache). Then they are forced to schedule and the load average quickly climbs to 300+. This is when I start seeing all CPUs spinning in lock_res_and_lock().

# Precreate a 1.6TB file.
srun -p pbatch -N 2 ./src/ior -b 800g -e -k -l -N 2 -o /p/lcraterz/bass6/iorfile -w -t 1m

# Run a 6400 task IOR to read the file using the flock option.
srun -p pbatch -N 100 -n 6400 ./src/ior -b 256m -e -k -l -N 6400 -o /p/lcraterz/bass6/iorfile2 -r -t 256k -L r

# Eat up memory on MDS. Usage: memhog [pages to alloc] [seconds to sleep]
./memhog $(( 18 * 1024 * 1024 * 1024 / 4096 )) 86400
Comment by Oleg Drokin [ 30/Apr/14 ]

Read flock! wow, that's really-really heavy stuff thats all computed under res lock, including deadlock detection too.
Naturally sleeping is not sone under spinlock, so the sleeping threads should not be adding to the particular spinlock contention woes.

I imagine you should see quite a bit of spinning with flocks on the same resource even with no memory pressure at all.

work in LU-1157 has reduced the flock overhead to some degree, but I imagine not too much esp if it's all on the same file.

Significantly redoing flock code fast is kind of hard, so I wonder if the replacement of spinlock with mutex will really help you here.
Additionally is the flock really needed by the app logic, or is or there in the bad assumption that it's operating on top of NFS? Could you mount with -o localflock on your clients to fake out consistent clustre-wide flocks? Benefits would include faster IO since flock calls would be totally local then - not loading MDS and avoiding extra RPC roundtrips to ask for and then cancel the locks.

Comment by Oleg Drokin [ 22/Jul/14 ]

Ok, here's a "flock to only use mutexes" patch: http://review.whamcloud.com/11171 - This one passes my testing.
I guess it should help in your case too.
I think Cliff will give it a shot sometime soon with ior in flock mode to see if there's any effect.

Additionally, I have refreshed http://review.whamcloud.com/10039 - this is a patch useful all in itself to reduce workload on server during lock granting
and http://review.whamcloud.com/10038 - this one would be a replacement to flock-only patch and makes entire server-side operation to rely on mutex, not on spinlock in resource lock. I guess it makes sense for all kind of locks after all, so we should give this another try just to make sure. This one also passes my testing which is a good sign I guess.

Additionally, rereading Ned's comment on the reproducer - this entire "kmalloc kicks out the problem" thing is really strange. Even when threads block in allocation, they could not be doing it with a spinlock held - as this would cause all sorts of deadlocks. If you ever have time to play with this again, please dump a crashdump to check if there are no threads that are diving into an allocation with a spinlock held please?
Also if low memory on MDS is absolutely a requirement, could it be that since we landed lu_cache limiting code, it's no longer a problem for you?

Comment by Oleg Drokin [ 22/Jul/14 ]

I guess an idea for a reproducer if it's just a long flock reprocessing issue is to have an multithreaded app run from a whole bunch of nodes where every thread would try to lock the same range (in blocking mode) in the same file to accumulate a multi-thousand list of blocked locks.

Make whoever has the lock first to wait a minute or whatever amount of time is needed to ensure that all threads on all nodes has sent the flock requests and they all blocked.
Then release the original lock, and after that every thread that has received a lock should release it immediately as well.

Comment by Oleg Drokin [ 22/Jul/14 ]

attached parallel_flock.c is my idea for a good reproducer of the flock issue.

It takes three arguments: -f filename - filename on lustre fs to work on
-n number of iterations
-s how long for the first lock to be kept.

To reproduce - run on a machine with a lot of clients using all available cores too.
default sleep time is just 6 seconds so possbly you want more than that.
Several iterations (default - 10).

While running, I imagine MDS should grind to a total halt so that even userspace barely responds if at all.

This is untested code, I just made sure it compiles.

Comment by Oleg Drokin [ 23/Jul/14 ]

Attaching parallel_flock_v2.c - this is the same as before, only this version actually works as expected.

Comment by Ned Bass [ 23/Jul/14 ]

Oleg, to clarify my comment regarding threads blocking in kmalloc(), I don't mean that they are doing so while holding a spinlock. My theory is that they spent so much time under the spin lock that they become eligible to reschedule. When they later call kmalloc() outside the spinlock it internally calls might_sleep() and reschedules the thread. Because there are other threads actively contending in lock_res_and_lock(), those blocked threads get starved for CPU time.

Comment by Oleg Drokin [ 23/Jul/14 ]

Ned, I see.
Well, I wonder if you can try my patch on your testbed to see if it hits similar overloaded issues right away by any chance?
Cliff tried it with the patch and MDS was basically under no load during the test run, but he was then preempted by other important testing, so there was no test without the patch to actually ensure that the reproducer reproduces anything.
So if you have time for that, it might be interesting exercise.

Comment by Ned Bass [ 24/Jul/14 ]

Oleg, it may not be right away, but we'll get this scheduled for testing.

Comment by James A Simmons [ 12/Jul/19 ]

Neil pushed some patches to address. I will push

Comment by Gerrit Updater [ 12/Jul/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35483
Subject: LU-4801 ldlm: discard l_lock from struct ldlm_lock.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c7e16e62306abdb62f1039957573400c9114ea3f

Comment by Gerrit Updater [ 12/Jul/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35484
Subject: LU-4801 ldlm: don't access l_resource when not locked.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ea026bb3eb9020233569659189850519bc99a17

Comment by Gerrit Updater [ 14/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35483/
Subject: LU-4801 ldlm: discard l_lock from struct ldlm_lock.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0584eb73dbb5b4c710a8c7eb1553ed5dad0c18d8

Comment by Gerrit Updater [ 15/Apr/20 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38238
Subject: Revert "LU-4801 ldlm: discard l_lock from struct ldlm_lock."
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 32095b5717954fa7260c9d6e369a208395bc39da

Comment by Gerrit Updater [ 15/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38238/
Subject: Revert "LU-4801 ldlm: discard l_lock from struct ldlm_lock."
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9051844cec34ab4f3427adc28bc1948706f5ffc2

Comment by Gerrit Updater [ 03/Sep/20 ]

Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/39811
Subject: LU-4801 ldlm: discard l_lock from struct ldlm_lock.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c634dbeb9e576574a903f83dfb9cf39cf8d5468b

Comment by Gerrit Updater [ 21/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39811/
Subject: LU-4801 ldlm: discard l_lock from struct ldlm_lock.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bb6edb7b8eeec65f46f8eaeb135e5dde13bf7ad8

Comment by James A Simmons [ 22/Apr/21 ]

Patch has landed. We will look to deploy this on our production systems soon. Any further work needed we can reopen this ticket.

Generated at Sat Feb 10 01:45:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.