Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-122

Revert bug 21122 since it causes deadlock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • 3
    • 21,122
    • 5060

    Description

      Recently I found a deadlock issue when I was running one of my tests. After analysing the log, I realized the deadlock issue was imported by bug 21122. Then I have to rethink about the patch and try to figure out the root cause. Finally I came up with a new fix.

      Let me describe the deadlock a little bit(in the before patched code):
      1. the page fault process would like to hold the page lock and call cl_unuse in cl_io_loop, cl_unuse will try to lock cl_lock mutex to do its job;
      2. meanwhile, if the cl_lock is being cancelled, the mutex of cl_lock has already been held and the pages covered by this lock will be evicted, so it will try to grab the page lock;
      3. deadlock.

      Let's go back to dig the root cause of bug 21122:
      From the log, we can see that the faulting page is actually covered by two locks, says lock A and lock B. lock B is being cancelled while lock A is queued by page fault process(this is why lock B won't be matched). However, because a drawback in the cl_lock_page_out function:

      cl_lock_at_page(env, lock->cll_descr.cld_obj,
      page, lock, 0, 0);

      where the last two parameters were set to 0, which means to not match the CANCELPEND locks. Unfortunately the lock A is marked to CANCELPEND because it blocks another lock. This causes the faulting page is being truncated. Then another page fault happens and the vmpage with same offset is created. This is why duplicated cl_pages were created and hit the assertion.

      So in the new fix, I just revert the patch of bug 21122, and change the parameters of cl_lock_at_page to (..., 1, 0). Hopefully this will save our life.

      Is this tricky?

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            jay Jinshan Xiong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: