Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
MDS node, Lustre 2.4.2-14chaos, ZFS OBD
-
3
-
15383
Description
After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:
mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed
Perhaps most significantly, this tag of our lustre tree includes the patch entitled:
LU-4584 mdt: ensure orig lock is found in hash upon resend
James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.
Whether it is related or not, we need to fix the problem.
Ok. I think I got to the root of this (using my patch http://review.whamcloud.com/11842 to make every request to trigger a resend which is great for teting this code that is rarely hit in our testing, but gets hit quite a bit on larger systems now that LLNL added the patch to shrink client supplied buffers).
First of all the code as is should have failed with this assert on the very first resent, but did not due to a bug. This was fixed in our tree, but not in 2.4 (
LU-3428) http://review.whamcloud.com/6511 - so you need this patch first.The assertion itself is due to a logic flaw. Then it becomes clear that patch
LU-4584you are carrying is wrong, in particular this part of it:The reason for that is because the mdt_intent_lock_replace assumes the lock has already been "replaced" into the client export, so it does not need any of those references - it cannot go away because it's already "owned" by the (not yet aware) client.
With this part removed (and that other uninitialized var fix from above) I am no longer hitting the assertion or having terrible lock deadlocks on resend from the start.
even despite that my racer testing on your tree cannot complete as other long fixed (in master tree) issues are getting in the way like
LU-4725andLU-5144At least this class of problems (
LU-2827related) should be extinguished for you now and hold you until you are ready to move to a newer release that has the more comprehensive lu-2827 patch with all of its afterfixes.