Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5071

statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.1
    • None
    • 2
    • 14002

    Description

      Hello,

      We are seeing following error message on Lustre 2.5.1 clients, and it makes the system not responsive. multiple clients were affected with this issue.

      System Details: Lustre 2.5.1 / RHEL 6.5

      Here are the node names, time stamps and one according message:
      May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
      May 4 18:15:43 uc1n468 kernel: LustreError: 42888:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
      May 4 18:54:19 uc1n059 kernel: LustreError: 111650:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      May 9 09:21:08 uc1n129 kernel: LustreError: 93767:0:(statahead.c:1704:do_statahead_enter()) LBUG
      May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG
      May 15 07:50:57 uc1n198 kernel: LustreError: 25007:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:

      Attachments

        1. ddn_lustre_showall-uc1n996_2014-05-15_192030.tar.bz2
          206 kB
        2. messages_uc1n055
          22 kB
        3. messages_uc1n059
          39 kB
        4. messages_uc1n129
          11 kB
        5. messages_uc1n198
          174 kB
        6. messages_uc1n468
          31 kB

        Issue Links

          Activity

            [LU-5071] statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.5.3 [ 11100 ]
            Labels Original: mq414 p4d
            bobijam Zhenyu Xu made changes -
            Fix Version/s New: Lustre 2.5.3 [ 11100 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            bobijam Zhenyu Xu added a comment -

            2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue.

            2.5.1
            #define IS_ERR(a) ((unsigned long)(a) > (unsigned long)-1000L)
            
            2.5.3
            # define IS_ERR_VALUE(x) ((x) >= (unsigned long)-4095)
            
            bobijam Zhenyu Xu added a comment - 2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue. 2.5.1 #define IS_ERR(a) ((unsigned long )(a) > (unsigned long )-1000L) 2.5.3 # define IS_ERR_VALUE(x) ((x) >= (unsigned long )-4095)
            pjones Peter Jones made changes -
            Labels Original: mq314 p4d New: mq414 p4d
            orentas Oz Rentas (Inactive) made changes -
            Comment [ SR-32850 | Karlsruhe Institute of Technology ]
            haasken Ryan Haasken added a comment -

            I think that the following assertion is already fixed by LU-3498:

            May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816
            May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
            

            That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed.

            Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the LU-3498 bug):

                    rc = PTR_ERR(kthread_run(ll_statahead_thread, parent,
                                             "ll_sa_%u", plli->lli_opendir_pid));
            ...
                    if (IS_ERR_VALUE(rc)) {
            ...
                            LASSERT(lli->lli_sai == NULL);
                            RETURN(-EAGAIN);
                    }
            

            So with the fix for LU-3498, this code will not be executed unless the thread creation actually fails. If the thread creation fails, your patched code which does an extra ll_sai_put(sai) will not be executed anyway.

            Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against LU-5274.

            haasken Ryan Haasken added a comment - I think that the following assertion is already fixed by LU-3498 : May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816 May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed. Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the LU-3498 bug): rc = PTR_ERR(kthread_run(ll_statahead_thread, parent, "ll_sa_%u" , plli->lli_opendir_pid)); ... if (IS_ERR_VALUE(rc)) { ... LASSERT(lli->lli_sai == NULL); RETURN(-EAGAIN); } So with the fix for LU-3498 , this code will not be executed unless the thread creation actually fails. If the thread creation fails, your patched code which does an extra ll_sai_put(sai) will not be executed anyway. Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against LU-5274 .
            jhammond John Hammond made changes -
            Link New: This issue is related to LU-5274 [ LU-5274 ]
            pjones Peter Jones made changes -
            Labels Original: p4d New: mq314 p4d

            ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause.

            Here is a patch which tries to fix this problem.
            http://review.whamcloud.com/#/c/10940/

            lixi Li Xi (Inactive) added a comment - ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause. Here is a patch which tries to fix this problem. http://review.whamcloud.com/#/c/10940/

            Does it re-appears in 2.5.1 as well?

            rganesan@ddn.com Rajeshwaran Ganesan added a comment - Does it re-appears in 2.5.1 as well?

            People

              bobijam Zhenyu Xu
              rganesan@ddn.com Rajeshwaran Ganesan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: