Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.0.0
-
None
-
3
-
21,122
-
5060
Description
Recently I found a deadlock issue when I was running one of my tests. After analysing the log, I realized the deadlock issue was imported by bug 21122. Then I have to rethink about the patch and try to figure out the root cause. Finally I came up with a new fix.
Let me describe the deadlock a little bit(in the before patched code):
1. the page fault process would like to hold the page lock and call cl_unuse in cl_io_loop, cl_unuse will try to lock cl_lock mutex to do its job;
2. meanwhile, if the cl_lock is being cancelled, the mutex of cl_lock has already been held and the pages covered by this lock will be evicted, so it will try to grab the page lock;
3. deadlock.
Let's go back to dig the root cause of bug 21122:
From the log, we can see that the faulting page is actually covered by two locks, says lock A and lock B. lock B is being cancelled while lock A is queued by page fault process(this is why lock B won't be matched). However, because a drawback in the cl_lock_page_out function:
cl_lock_at_page(env, lock->cll_descr.cld_obj,
page, lock, 0, 0);
where the last two parameters were set to 0, which means to not match the CANCELPEND locks. Unfortunately the lock A is marked to CANCELPEND because it blocks another lock. This causes the faulting page is being truncated. Then another page fault happens and the vmpage with same offset is created. This is why duplicated cl_pages were created and hit the assertion.
So in the new fix, I just revert the patch of bug 21122, and change the parameters of cl_lock_at_page to (..., 1, 0). Hopefully this will save our life.
Is this tricky?