[LU-5071] statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: Created: 16/May/14  Updated: 06/Sep/14  Resolved: 02/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Rajeshwaran Ganesan Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: File ddn_lustre_showall-uc1n996_2014-05-15_192030.tar.bz2     HTML File messages_uc1n055     HTML File messages_uc1n059     HTML File messages_uc1n129     HTML File messages_uc1n198     HTML File messages_uc1n468    
Issue Links:
Related
is related to LU-4558 Crash in cl_lock_put on racer Resolved
is related to LU-3498 most uses of IS_ERR_VALUE() are incor... Resolved
is related to LU-5274 ll_statahead_thread() may leak parent... Resolved
Severity: 2
Rank (Obsolete): 14002

 Description   

Hello,

We are seeing following error message on Lustre 2.5.1 clients, and it makes the system not responsive. multiple clients were affected with this issue.

System Details: Lustre 2.5.1 / RHEL 6.5

Here are the node names, time stamps and one according message:
May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
May 4 18:15:43 uc1n468 kernel: LustreError: 42888:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:
May 4 18:54:19 uc1n059 kernel: LustreError: 111650:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
May 9 09:21:08 uc1n129 kernel: LustreError: 93767:0:(statahead.c:1704:do_statahead_enter()) LBUG
May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG
May 15 07:50:57 uc1n198 kernel: LustreError: 25007:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:



 Comments   
Comment by Peter Jones [ 16/May/14 ]

Rajesh

Could you please confirm that it is vanilla 2.5.1 on both servers and clients for this cluster? Are any other Lustre versions or patches involved?

Bobijam

Does this seem related to existing tickets LU-4797/4693/4558?

Thanks

Peter

Comment by Rajeshwaran Ganesan [ 18/May/14 ]

Servers are in 2.4.3
Clients are in 2.5.1

Comment by Zhenyu Xu [ 19/May/14 ]

the do_statahead_enter() LBUG can be cured by this back port patch http://review.whamcloud.com/10363

Comment by Zhenyu Xu [ 19/May/14 ]

the lovsub_lock_state() LBUG was fixed in b2_5 branch, the patch is at http://review.whamcloud.com/9881

Comment by Rajeshwaran Ganesan [ 20/May/14 ]

Could you please provide source RPM with the patches?

Comment by Peter Jones [ 20/May/14 ]

Rajesh

These are included by default. For example, http://review.whamcloud.com/#/c/10363/ has a link to the build on the Jenkins server http://build.whamcloud.com/job/lustre-reviews/23961/ Selecting the desired distro version allows you to drill into specific build artifacts - http://build.whamcloud.com/job/lustre-reviews/23961/arch=i686,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/ , say.

Peter

Comment by Ryan Haasken [ 20/Jun/14 ]

Zhenyu has identified two of the LBUGs as LU-3498 and LU-4558, and both of those bugs are fixed in b2_5 and master. Since the LBUG which is in the summary of this ticket has been fixed, should this bug be resolved?

I suppose there is still this LBUG:

May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG 

But without any information other than the location of the LBUG, I think this bug isn't helpful. There is no information about that LBUG in any of the attachments either, as far as I can tell. If the bug will be kept open for the osc_lock_wait() LBUG, would it be possible to update the summary and description so that it doesn't look like LU-3498?

Comment by Rajeshwaran Ganesan [ 30/Jun/14 ]

In regards to the above comment...

Does the above issue is fixed on 2.5.2, or still its a LBUG. Our customer saw the message once in the log.

May 10 09:28:14 uc1n996 kernel: LustreError:7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG
May 10 09:28:14 uc1n996 kernel: Pid: 7387, comm: less

Comment by Ryan Haasken [ 30/Jun/14 ]

Not a lot of information here to go on. The assertion which was triggered looks like the same one as in LU-1356, but that bug was fixed way back in 2.3.0 and 2.1.4.

Comment by Rajeshwaran Ganesan [ 30/Jun/14 ]

Does it re-appears in 2.5.1 as well?

Comment by Li Xi (Inactive) [ 02/Jul/14 ]

ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause.

Here is a patch which tries to fix this problem.
http://review.whamcloud.com/#/c/10940/

Comment by Ryan Haasken [ 02/Jul/14 ]

I think that the following assertion is already fixed by LU-3498:

May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816
May  4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed:

That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed.

Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the LU-3498 bug):

        rc = PTR_ERR(kthread_run(ll_statahead_thread, parent,
                                 "ll_sa_%u", plli->lli_opendir_pid));
...
        if (IS_ERR_VALUE(rc)) {
...
                LASSERT(lli->lli_sai == NULL);
                RETURN(-EAGAIN);
        }

So with the fix for LU-3498, this code will not be executed unless the thread creation actually fails. If the thread creation fails, your patched code which does an extra ll_sai_put(sai) will not be executed anyway.

Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against LU-5274.

Comment by Zhenyu Xu [ 02/Sep/14 ]

2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue.

2.5.1
#define IS_ERR(a) ((unsigned long)(a) > (unsigned long)-1000L)
2.5.3
# define IS_ERR_VALUE(x) ((x) >= (unsigned long)-4095)
Generated at Sat Feb 10 01:48:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.