Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.6.0
    • 3
    • 12451

    Description

      As of yesterday, when testing master with mmstress, I saw a huge number of threads stuck waiting here, with IO failing to complete:

      sleep_on_page+0xe/0x20;
      wait_on_page_bit+0x74/0x80;
      vvp_io_fault_start+0x855/0xc20 [lustre]; 
      cl_io_start+0x72/0x140 [obdclass]; 
      cl_io_loop+0xac/0x1a0 [obdclass]; 
      ll_page_mkwrite+0x280/0x6c0 [lustre]; 
      __do_fault+0xe7/0x570;
       handle_pte_fault+0xa4/0xcc0; 
      handle_mm_fault+0x1ae/0x240; 
      do_page_fault+0x18f/0x420; 
      page_fault+0x1f/0x30; 0x200007ea; 0xffffffffffffffff
      

      Effectively, they seem to be unable to do page faulting. We ran a quick Cray IO regression suite on a system and many (or perhaps most) of those tests failed as well.

      I looked at the list of new commits since I had last built & used master successfully, and this one jumped out at me:

      LU-3531 mdc: release dir page cache after accessing

      Release the dir page cache in llite/lmv, so the page will be hold until entires was filled by filldir.

      Signed-off-by: wang di <di.wang@intel.com>
      Change-Id: I8b24bec74b14ff2b65130c02294821fc16ca1421
      Reviewed-on: http://review.whamcloud.com/8935
      Tested-by: Jenkins
      Reviewed-by: John L. Hammond <john.hammond@intel.com>
      Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
      Tested-by: Oleg Drokin <oleg.drokin@intel.com>

      But I reverted only this commit and problems continued.

      I rolled back about a week of commits to get back to something I knew was good. I rolled back everything after this and the problem went away:
      commit b9b4614c1e302058ed9863b1ab73b7def2c5c924
      Author: Oleg Drokin <oleg.drokin@intel.com>
      Date: Mon Jan 20 23:10:06 2014 +0000

      Revert "LU-3319 procfs: move osp proc handling to seq_files"

      This seems to be causing issues like LU-45-13 and LU-4510
      This reverts commit a97e4898ad9e0b65f457b01bdfa954f7d7cd272d.

      Change-Id: I6066a255ded24dbdb76b4804e82a377f1069af5f
      Reviewed-on: http://review.whamcloud.com/8931
      Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
      Tested-by: Oleg Drokin <oleg.drokin@intel.com>

      That puts me 11 commits behind master (or it was 11 when I last checked). I'm not sure which patch caused the problem, but current master is broken.

      Attachments

        Issue Links

          Activity

            [LU-4561] threads stuck waiting on page bit

            Reopening to remove fix version as it is a duplicate.

            jlevi Jodi Levi (Inactive) added a comment - Reopening to remove fix version as it is a duplicate.

            probably a duplicate of LU-4540.

            jay Jinshan Xiong (Inactive) added a comment - probably a duplicate of LU-4540 .
            green Oleg Drokin added a comment -

            This is the patch form Jinshan I am testing ATM to combat this (I also see it in my testing):

            --- a/lustre/llite/rw26.c
            +++ b/lustre/llite/rw26.c
            @@ -546,7 +546,8 @@ static int ll_write_begin(struct file *file, struct address_space *mapping,
             
             	/* To avoid deadlock, try to lock page first. */
             	vmpage = grab_cache_page_nowait(mapping, index);
            -	if (unlikely(vmpage == NULL || PageDirty(vmpage))) {
            +	if (unlikely(vmpage == NULL || PageDirty(vmpage) ||
            +	    PageWriteback(vmpage))) {
             		struct ccc_io *cio = ccc_env_io(env);
             		struct cl_page_list *plist = &cio->u.write.cui_queue;
             
            @@ -555,7 +556,7 @@ static int ll_write_begin(struct file *file, struct address_space *mapping,
             		 * because it holds page lock of a dirty page and request for
             		 * more grants. It's okay for the dirty page to be the first
             		 * one in commit page list, though. */
            -		if (vmpage != NULL && PageDirty(vmpage) && plist->pl_nr > 0) {
            +		if (vmpage != NULL && plist->pl_nr > 0) {
             			unlock_page(vmpage);
             			page_cache_release(vmpage);
             			vmpage = NULL;
            
            green Oleg Drokin added a comment - This is the patch form Jinshan I am testing ATM to combat this (I also see it in my testing): --- a/lustre/llite/rw26.c +++ b/lustre/llite/rw26.c @@ -546,7 +546,8 @@ static int ll_write_begin(struct file *file, struct address_space *mapping, /* To avoid deadlock, try to lock page first. */ vmpage = grab_cache_page_nowait(mapping, index); - if (unlikely(vmpage == NULL || PageDirty(vmpage))) { + if (unlikely(vmpage == NULL || PageDirty(vmpage) || + PageWriteback(vmpage))) { struct ccc_io *cio = ccc_env_io(env); struct cl_page_list *plist = &cio->u.write.cui_queue; @@ -555,7 +556,7 @@ static int ll_write_begin(struct file *file, struct address_space *mapping, * because it holds page lock of a dirty page and request for * more grants. It's okay for the dirty page to be the first * one in commit page list, though. */ - if (vmpage != NULL && PageDirty(vmpage) && plist->pl_nr > 0) { + if (vmpage != NULL && plist->pl_nr > 0) { unlock_page(vmpage); page_cache_release(vmpage); vmpage = NULL;

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: