[LU-5071] statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: Created: 16/May/14 Updated: 06/Sep/14 Resolved: 02/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Rajeshwaran Ganesan | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 14002 | ||||||||||||||||
| Description |
|
Hello, We are seeing following error message on Lustre 2.5.1 clients, and it makes the system not responsive. multiple clients were affected with this issue. System Details: Lustre 2.5.1 / RHEL 6.5 Here are the node names, time stamps and one according message: |
| Comments |
| Comment by Peter Jones [ 16/May/14 ] |
|
Rajesh Could you please confirm that it is vanilla 2.5.1 on both servers and clients for this cluster? Are any other Lustre versions or patches involved? Bobijam Does this seem related to existing tickets Thanks Peter |
| Comment by Rajeshwaran Ganesan [ 18/May/14 ] |
|
Servers are in 2.4.3 |
| Comment by Zhenyu Xu [ 19/May/14 ] |
|
the do_statahead_enter() LBUG can be cured by this back port patch http://review.whamcloud.com/10363 |
| Comment by Zhenyu Xu [ 19/May/14 ] |
|
the lovsub_lock_state() LBUG was fixed in b2_5 branch, the patch is at http://review.whamcloud.com/9881 |
| Comment by Rajeshwaran Ganesan [ 20/May/14 ] |
|
Could you please provide source RPM with the patches? |
| Comment by Peter Jones [ 20/May/14 ] |
|
Rajesh These are included by default. For example, http://review.whamcloud.com/#/c/10363/ has a link to the build on the Jenkins server http://build.whamcloud.com/job/lustre-reviews/23961/ Selecting the desired distro version allows you to drill into specific build artifacts - http://build.whamcloud.com/job/lustre-reviews/23961/arch=i686,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/ , say. Peter |
| Comment by Ryan Haasken [ 20/Jun/14 ] |
|
Zhenyu has identified two of the LBUGs as I suppose there is still this LBUG: May 10 09:28:14 uc1n996 kernel: LustreError: 7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG But without any information other than the location of the LBUG, I think this bug isn't helpful. There is no information about that LBUG in any of the attachments either, as far as I can tell. If the bug will be kept open for the osc_lock_wait() LBUG, would it be possible to update the summary and description so that it doesn't look like |
| Comment by Rajeshwaran Ganesan [ 30/Jun/14 ] |
|
In regards to the above comment... Does the above issue is fixed on 2.5.2, or still its a LBUG. Our customer saw the message once in the log. May 10 09:28:14 uc1n996 kernel: LustreError:7387:0:(osc_lock.c:1224:osc_lock_wait()) LBUG |
| Comment by Ryan Haasken [ 30/Jun/14 ] |
|
Not a lot of information here to go on. The assertion which was triggered looks like the same one as in |
| Comment by Rajeshwaran Ganesan [ 30/Jun/14 ] |
|
Does it re-appears in 2.5.1 as well? |
| Comment by Li Xi (Inactive) [ 02/Jul/14 ] |
|
ll_statahead_thread() calls ll_sai_get() at the first begining, but does not ll_sai_put() when ll_prep_md_op_data() failes. I think that might be the cause. Here is a patch which tries to fix this problem. |
| Comment by Ryan Haasken [ 02/Jul/14 ] |
|
I think that the following assertion is already fixed by May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread, rc: -2816 May 4 11:03:28 uc1n055 kernel: LustreError: 1979:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: That assertion is inside an if block which is only executed when do_statahead_enter() thinks that the thread creation failed. Here is the relevant portion of do_statahead_enter() in version 2.5.1 (which has the rc = PTR_ERR(kthread_run(ll_statahead_thread, parent,
"ll_sa_%u", plli->lli_opendir_pid));
...
if (IS_ERR_VALUE(rc)) {
...
LASSERT(lli->lli_sai == NULL);
RETURN(-EAGAIN);
}
So with the fix for Unless I'm missing something, I don't think this patch belongs to this ticket. It probably is more appropriate to link that patch against |
| Comment by Zhenyu Xu [ 02/Sep/14 ] |
|
2.5.1 code has a glitch, kthread_run() returns thread id which could be a big value, and IS_ERR defines (-1000, -1) which is too narrow. 2.5.3 code does not has this issue. 2.5.1 #define IS_ERR(a) ((unsigned long)(a) > (unsigned long)-1000L) 2.5.3 # define IS_ERR_VALUE(x) ((x) >= (unsigned long)-4095)
|