Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5071

statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.1
    • None
    • 2
    • 14002

    Description

      Hello,

      We are seeing following error message on Lustre 2.5.1 clients, and it makes the system not responsive. multiple clients were affected with this issue.

      System Details: Lustre 2.5.1 / RHEL 6.5

      Here are the node names, time stamps and one according message:
      May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
      May 4 18:15:43 uc1n468 kernel: LustreError: 42888:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
      May 4 18:54:19 uc1n059 kernel: LustreError: 111650:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      May 9 09:21:08 uc1n129 kernel: LustreError: 93767:0:(statahead.c:1704:do_statahead_enter()) LBUG
      May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG
      May 15 07:50:57 uc1n198 kernel: LustreError: 25007:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:

      Attachments

        1. ddn_lustre_showall-uc1n996_2014-05-15_192030.tar.bz2
          206 kB
        2. messages_uc1n055
          22 kB
        3. messages_uc1n059
          39 kB
        4. messages_uc1n129
          11 kB
        5. messages_uc1n198
          174 kB
        6. messages_uc1n468
          31 kB

        Issue Links

          Activity

            [LU-5071] statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
            bobijam Zhenyu Xu added a comment -

            2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue.

            2.5.1
            #define IS_ERR(a) ((unsigned long)(a) > (unsigned long)-1000L)
            
            2.5.3
            # define IS_ERR_VALUE(x) ((x) >= (unsigned long)-4095)
            
            bobijam Zhenyu Xu added a comment - 2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue. 2.5.1 #define IS_ERR(a) ((unsigned long )(a) > (unsigned long )-1000L) 2.5.3 # define IS_ERR_VALUE(x) ((x) >= (unsigned long )-4095)
            haasken Ryan Haasken added a comment -

            I think that the following assertion is already fixed by LU-3498:

            May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816
            May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
            

            That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed.

            Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the LU-3498 bug):

                    rc = PTR_ERR(kthread_run(ll_statahead_thread, parent,
                                             "ll_sa_%u", plli->lli_opendir_pid));
            ...
                    if (IS_ERR_VALUE(rc)) {
            ...
                            LASSERT(lli->lli_sai == NULL);
                            RETURN(-EAGAIN);
                    }
            

            So with the fix for LU-3498, this code will not be executed unless the thread creation actually fails. If the thread creation fails, your patched code which does an extra ll_sai_put(sai) will not be executed anyway.

            Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against LU-5274.

            haasken Ryan Haasken added a comment - I think that the following assertion is already fixed by LU-3498 : May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816 May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed. Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the LU-3498 bug): rc = PTR_ERR(kthread_run(ll_statahead_thread, parent, "ll_sa_%u" , plli->lli_opendir_pid)); ... if (IS_ERR_VALUE(rc)) { ... LASSERT(lli->lli_sai == NULL); RETURN(-EAGAIN); } So with the fix for LU-3498 , this code will not be executed unless the thread creation actually fails. If the thread creation fails, your patched code which does an extra ll_sai_put(sai) will not be executed anyway. Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against LU-5274 .

            ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause.

            Here is a patch which tries to fix this problem.
            http://review.whamcloud.com/#/c/10940/

            lixi Li Xi (Inactive) added a comment - ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause. Here is a patch which tries to fix this problem. http://review.whamcloud.com/#/c/10940/

            Does it re-appears in 2.5.1 as well?

            rganesan@ddn.com Rajeshwaran Ganesan added a comment - Does it re-appears in 2.5.1 as well?
            haasken Ryan Haasken added a comment -

            Not a lot of information here to go on. The assertion which was triggered looks like the same one as in LU-1356, but that bug was fixed way back in 2.3.0 and 2.1.4.

            haasken Ryan Haasken added a comment - Not a lot of information here to go on. The assertion which was triggered looks like the same one as in LU-1356 , but that bug was fixed way back in 2.3.0 and 2.1.4.

            In regards to the above comment...

            Does the above issue is fixed on 2.5.2, or still its a LBUG. Our customer saw the message once in the log.

            May 10 09:28:14 uc1n996 kernel: LustreError:7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG
            May 10 09:28:14 uc1n996 kernel: Pid: 7387, comm: less

            rganesan@ddn.com Rajeshwaran Ganesan added a comment - In regards to the above comment... Does the above issue is fixed on 2.5.2, or still its a LBUG. Our customer saw the message once in the log. May 10 09:28:14 uc1n996 kernel: LustreError:7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG May 10 09:28:14 uc1n996 kernel: Pid: 7387, comm: less
            haasken Ryan Haasken added a comment -

            Zhenyu has identified two of the LBUGs as LU-3498 and LU-4558, and both of those bugs are fixed in b2_5 and master. Since the LBUG which is in the summary of this ticket has been fixed, should this bug be resolved?

            I suppose there is still this LBUG:

            May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG 
            

            But without any information other than the location of the LBUG, I think this bug isn't helpful. There is no information about that LBUG in any of the attachments either, as far as I can tell. If the bug will be kept open for the osc_lock_wait() LBUG, would it be possible to update the summary and description so that it doesn't look like LU-3498?

            haasken Ryan Haasken added a comment - Zhenyu has identified two of the LBUGs as LU-3498 and LU-4558 , and both of those bugs are fixed in b2_5 and master. Since the LBUG which is in the summary of this ticket has been fixed, should this bug be resolved? I suppose there is still this LBUG: May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG But without any information other than the location of the LBUG, I think this bug isn't helpful. There is no information about that LBUG in any of the attachments either, as far as I can tell. If the bug will be kept open for the osc_lock_wait() LBUG, would it be possible to update the summary and description so that it doesn't look like LU-3498 ?
            pjones Peter Jones added a comment -

            Rajesh

            These are included by default. For example, http://review.whamcloud.com/#/c/10363/ has a link to the build on the Jenkins server http://build.whamcloud.com/job/lustre-reviews/23961/ Selecting the desired distro version allows you to drill into specific build artifacts - http://build.whamcloud.com/job/lustre-reviews/23961/arch=i686,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/ , say.

            Peter

            pjones Peter Jones added a comment - Rajesh These are included by default. For example, http://review.whamcloud.com/#/c/10363/ has a link to the build on the Jenkins server http://build.whamcloud.com/job/lustre-reviews/23961/ Selecting the desired distro version allows you to drill into specific build artifacts - http://build.whamcloud.com/job/lustre-reviews/23961/arch=i686,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/ , say. Peter

            Could you please provide source RPM with the patches?

            rganesan@ddn.com Rajeshwaran Ganesan added a comment - Could you please provide source RPM with the patches?
            bobijam Zhenyu Xu added a comment -

            the lovsub_lock_state() LBUG was fixed in b2_5 branch, the patch is at http://review.whamcloud.com/9881

            bobijam Zhenyu Xu added a comment - the lovsub_lock_state() LBUG was fixed in b2_5 branch, the patch is at http://review.whamcloud.com/9881

            People

              bobijam Zhenyu Xu
              rganesan@ddn.com Rajeshwaran Ganesan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: