[LU-122] Revert bug 21122 since it causes deadlock Created: 10/Mar/11  Updated: 28/Jun/11  Resolved: 25/Apr/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Jinshan Xiong (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 21,122
Rank (Obsolete): 5060

 Description   

Recently I found a deadlock issue when I was running one of my tests. After analysing the log, I realized the deadlock issue was imported by bug 21122. Then I have to rethink about the patch and try to figure out the root cause. Finally I came up with a new fix.

Let me describe the deadlock a little bit(in the before patched code):
1. the page fault process would like to hold the page lock and call cl_unuse in cl_io_loop, cl_unuse will try to lock cl_lock mutex to do its job;
2. meanwhile, if the cl_lock is being cancelled, the mutex of cl_lock has already been held and the pages covered by this lock will be evicted, so it will try to grab the page lock;
3. deadlock.

Let's go back to dig the root cause of bug 21122:
From the log, we can see that the faulting page is actually covered by two locks, says lock A and lock B. lock B is being cancelled while lock A is queued by page fault process(this is why lock B won't be matched). However, because a drawback in the cl_lock_page_out function:

cl_lock_at_page(env, lock->cll_descr.cld_obj,
page, lock, 0, 0);

where the last two parameters were set to 0, which means to not match the CANCELPEND locks. Unfortunately the lock A is marked to CANCELPEND because it blocks another lock. This causes the faulting page is being truncated. Then another page fault happens and the vmpage with same offset is created. This is why duplicated cl_pages were created and hit the assertion.

So in the new fix, I just revert the patch of bug 21122, and change the parameters of cl_lock_at_page to (..., 1, 0). Hopefully this will save our life.

Is this tricky?



 Comments   
Comment by Build Master (Inactive) [ 11/Mar/11 ]

Integrated in reviews-centos5 #432
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : e2d57e76eaba3a975043a3e5b9eb920e8d9cec77
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 12/Mar/11 ]

Integrated in reviews-centos5 #440
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : e2d57e76eaba3a975043a3e5b9eb920e8d9cec77
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_page.c
  • lustre/llite/llite_mmap.c
Comment by Peter Jones [ 15/Mar/11 ]

Cliff

Can you please add this patch to the queue to test on Hyperion?

Thanks

Peter

Comment by Build Master (Inactive) [ 16/Mar/11 ]

Integrated in reviews-centos5 #489
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : 398fbf1a08b45a2292322a4e8396af5b623fbe31
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_page.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 16/Mar/11 ]

Integrated in reviews-centos5 #490
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : 026964d4ccae351e7aa5561fae976f6fe3fc2c55
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Cliff White (Inactive) [ 17/Mar/11 ]

I will be testing builds #490 and #210 (client) on Hyperion, should be running today

Comment by Oleg Drokin [ 18/Mar/11 ]

Would have been great if you added your findings about the bug here as well, not just in patch description in gerrit.

I wonder is your issue rhel5 specific and clears in rhel6 all by itself?
I actually preferred the way we had it after patch in 21122 wrt page locking, so that we don't need to get locked page, unlock it and then lock again. Can we fix this in any other way and still retain the page locked if we got it locked?

Comment by Jinshan Xiong (Inactive) [ 18/Mar/11 ]

I think the deadlock would happen to both rhel5 and rhel6.

WRT the page locking, I think anyway we have to return an unlocked page in vvp_io_fault_start, otherwise it would cause deadlock. But we may make vvp_io_kernel_fault to return a locked page(we still have to do this tricky check in filemap_nopage case since it returns an unlocked page) and then unlock it in vvp_io_fault_start. It's acceptable to me if you think it will be much better.

Comment by Build Master (Inactive) [ 25/Mar/11 ]

Integrated in reviews-centos5 #570
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : cd180a0ef35d87cd4e64d71db8f52d3916b7afae
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 01/Apr/11 ]

Integrated in reviews-centos5 #636
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : a91a2a4fdd7550f08ae3b00f58f9eeec3ac3777b
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 01/Apr/11 ]

Integrated in lustre-reviews » server,el6 #51
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : a91a2a4fdd7550f08ae3b00f58f9eeec3ac3777b
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 01/Apr/11 ]

Integrated in lustre-reviews » client,el5 #51
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : a91a2a4fdd7550f08ae3b00f58f9eeec3ac3777b
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 01/Apr/11 ]

Integrated in lustre-reviews » client,el6 #51
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : a91a2a4fdd7550f08ae3b00f58f9eeec3ac3777b
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 01/Apr/11 ]

Integrated in lustre-reviews » server,el5 #51
LU-122 Revert the patch on bug 21122 and come up with a new fix

Jinshan Xiong : a91a2a4fdd7550f08ae3b00f58f9eeec3ac3777b
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
Comment by Peter Jones [ 07/Apr/11 ]

Update from Bull "Fix delivered, no new occurrence of the bug so far! "

Comment by Peter Jones [ 21/Apr/11 ]

Still no reoccurences at CEA running with the patch

Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Peter Jones [ 25/Apr/11 ]

Patch landed for 2.1. Please reopen if any further work needed

Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,client,el5,ofa #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,server,el5,ofa #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Apr/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #43
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 27/Apr/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #45
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 27/Apr/11 ]

Integrated in lustre-master » i686,server,el5,ofa #45
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 27/Apr/11 ]

Integrated in lustre-master » i686,client,el5,ofa #45
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/obdclass/cl_lock.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 27/Apr/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #45
LU-122 Revert the patch on bug 21122 and come up with a new fix

Oleg Drokin : 32b2ddf168b846ccf8c83329728905f6c5c8bbcb
Files :

  • lustre/llite/llite_mmap.c
  • lustre/llite/vvp_io.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in hydra-server » x86_64,el5 #12
Initial 'events' interface (LU-104, LU-122)

john :
Files :

  • monitor/lib/lustre_audit.py
  • monitor/static/images/dialog-error.png
  • settings.py
  • monitor/urls.py
  • monitor/views.py
  • monitor/static/css/base.css
  • monitor/templates/events.html
  • monitor/models.py
  • monitor/bin/hydra-debug.py
  • monitor/static/images/dialog-warning.png
  • monitor/templates/base.html
  • monitor/static/images/dialog-information.png
Generated at Sat Feb 10 01:03:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.