Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
MDS node, Lustre 2.4.2-14chaos, ZFS OBD
-
3
-
15383
Description
After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:
mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed
Perhaps most significantly, this tag of our lustre tree includes the patch entitled:
LU-4584 mdt: ensure orig lock is found in hash upon resend
James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.
Whether it is related or not, we need to fix the problem.
This is a not a machine for which we can provide logs or crash dumps. No, we don't know the details of the various workloads going on at the time.
The filesystem has 768 OSTs on 768 OSS nodes. Default stripe value of 1. Some users do use stripe counts at various widths, some over 700.
I don't know what your notes say, but I checked and we crashed the MDS fifteen times during the three days that we were running with the
LU-4584patch. We then rebooted the MDS onto Lustre version 2.4.2-14.1chaos (which does nothing but revert theLU-4584patch), and we have not seen this ticket's assertion in over three days.While not conclusive, the circumstantial evidence points strongly at the
LU-4584patch either introducing the bug or making it much easier to hit.Here is the list of 2.4.2-14chaos to 2.4.2-14.1chaos changes:
You can find those tags at github.com/chaos/lustre.