[LU-4561] threads stuck waiting on page bit Created: 29/Jan/14 Updated: 03/Jun/14 Resolved: 03/Jun/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12451 | ||||||||
| Description |
|
As of yesterday, when testing master with mmstress, I saw a huge number of threads stuck waiting here, with IO failing to complete: sleep_on_page+0xe/0x20; wait_on_page_bit+0x74/0x80; vvp_io_fault_start+0x855/0xc20 [lustre]; cl_io_start+0x72/0x140 [obdclass]; cl_io_loop+0xac/0x1a0 [obdclass]; ll_page_mkwrite+0x280/0x6c0 [lustre]; __do_fault+0xe7/0x570; handle_pte_fault+0xa4/0xcc0; handle_mm_fault+0x1ae/0x240; do_page_fault+0x18f/0x420; page_fault+0x1f/0x30; 0x200007ea; 0xffffffffffffffff Effectively, they seem to be unable to do page faulting. We ran a quick Cray IO regression suite on a system and many (or perhaps most) of those tests failed as well. I looked at the list of new commits since I had last built & used master successfully, and this one jumped out at me:
Release the dir page cache in llite/lmv, so the page will be hold until entires was filled by filldir. Signed-off-by: wang di <di.wang@intel.com> But I reverted only this commit and problems continued. I rolled back about a week of commits to get back to something I knew was good. I rolled back everything after this and the problem went away: Revert " This seems to be causing issues like Change-Id: I6066a255ded24dbdb76b4804e82a377f1069af5f |
| Comments |
| Comment by Oleg Drokin [ 29/Jan/14 ] |
|
This is the patch form Jinshan I am testing ATM to combat this (I also see it in my testing): --- a/lustre/llite/rw26.c
+++ b/lustre/llite/rw26.c
@@ -546,7 +546,8 @@ static int ll_write_begin(struct file *file, struct address_space *mapping,
/* To avoid deadlock, try to lock page first. */
vmpage = grab_cache_page_nowait(mapping, index);
- if (unlikely(vmpage == NULL || PageDirty(vmpage))) {
+ if (unlikely(vmpage == NULL || PageDirty(vmpage) ||
+ PageWriteback(vmpage))) {
struct ccc_io *cio = ccc_env_io(env);
struct cl_page_list *plist = &cio->u.write.cui_queue;
@@ -555,7 +556,7 @@ static int ll_write_begin(struct file *file, struct address_space *mapping,
* because it holds page lock of a dirty page and request for
* more grants. It's okay for the dirty page to be the first
* one in commit page list, though. */
- if (vmpage != NULL && PageDirty(vmpage) && plist->pl_nr > 0) {
+ if (vmpage != NULL && plist->pl_nr > 0) {
unlock_page(vmpage);
page_cache_release(vmpage);
vmpage = NULL;
|
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/14 ] |
|
probably a duplicate of |
| Comment by Jodi Levi (Inactive) [ 03/Jun/14 ] |
|
Reopening to remove fix version as it is a duplicate. |