[LU-122] Revert bug 21122 since it causes deadlock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.0.0
Labels:
None

Severity:
3
Bugzilla ID:
21,122
Rank (Obsolete):
5060

Description

Recently I found a deadlock issue when I was running one of my tests. After analysing the log, I realized the deadlock issue was imported by bug 21122. Then I have to rethink about the patch and try to figure out the root cause. Finally I came up with a new fix.

Let me describe the deadlock a little bit(in the before patched code):
1. the page fault process would like to hold the page lock and call cl_unuse in cl_io_loop, cl_unuse will try to lock cl_lock mutex to do its job;
2. meanwhile, if the cl_lock is being cancelled, the mutex of cl_lock has already been held and the pages covered by this lock will be evicted, so it will try to grab the page lock;
3. deadlock.

Let's go back to dig the root cause of bug 21122:
From the log, we can see that the faulting page is actually covered by two locks, says lock A and lock B. lock B is being cancelled while lock A is queued by page fault process(this is why lock B won't be matched). However, because a drawback in the cl_lock_page_out function:

cl_lock_at_page(env, lock->cll_descr.cld_obj,
page, lock, 0, 0);

where the last two parameters were set to 0, which means to not match the CANCELPEND locks. Unfortunately the lock A is marked to CANCELPEND because it blocks another lock. This causes the faulting page is being truncated. Then another page fault happens and the vmpage with same offset is created. This is why duplicated cl_pages were created and hit the assertion.

So in the new fix, I just revert the patch of bug 21122, and change the parameters of cl_lock_at_page to (..., 1, 0). Hopefully this will save our life.

Is this tricky?

Attachments

Activity

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Jinshan Xiong (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Mar/11 10:30 PM

Updated:: 28/Jun/11 7:50 PM

Resolved:: 25/Apr/11 10:55 AM