A number of updates here. Alexey L. provided a patch for Cray to test, and we confirmed it fixes the deadlock. (We've tested on Cray's version of 2.5, and on current master.)
I've attached that patch.
Essentially, in lov_lock_sub_init, it removes a for loop which has this comment on it:
/*
- Then, create sub-locks. Once at least one sub-lock was created,
- top-lock can be reached by other threads.
*/
Here's Alexey's explanation of the deadlock with a slight edit by me:
—
1) client1 has a full set of locks for a file
{ stripe1, stripe2, stripe3}
, client2 has no locks.
2) client1 finished an io and put locks to inactive states,
so client2 is able to take 2 of 3 locks from client1 and reference them.
3) client1 enters new IO loop (different process)
and start to take a lock quorum - in top lock allocation time
it finds a stripe3 lock in cache and holds it.
4) client2 tries to take a stripe3 lock but
it's protected by hold flag now and BL AST ignored
(just mark a lock a callback pending).
client2 enters the waiting for grant loop.
5) client1 enter to lock enqueue loop to take a stripe1 lock,
but client2 can't return that lock because it referenced.
... so deadlock.
—
This patch removes sub lock creation from lov_lock_sub_init.
If sub locks are not created in lov_lock_sub_init,
they are created in the enqueue process. This avoids the
deadlock described above by avoiding step 3.
—
We've run our general Lustre IO performance test suite (multiple clients, various IO patterns, various stripe counts) and weren't able to measure any difference between with the patch and without the patch.
We're currently running it through sanity and some other stress testing.
Going forward, we're wondering if this is expected to be the final version of the patch. Essentially, this fixes the problem with this performance optimization by removing the optimization. As far as I can tell, the optimization seems to be basically allowing better re-use of cl_locks, and as I understand it, caching & re-use of cl_locks is slated for removal in the upcoming CLIO simplification anyway.
So it seems like it may be OK to leave this optimization out.
Is this the plan, or is the hope for a patch that actually fixes the performance optimization rather than removes it?
This is a duplicate of:
https://jira.hpdd.intel.com/browse/LU-4381