[LU-508] Kernel panic on ...BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/extents.c:1920 Created: 19/Jul/11 Updated: 27/Mar/12 Resolved: 28/Nov/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marek Magrys | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5.5 with: |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 4841 | ||||
| Description |
|
Hello, We observe a subsequent Kernel panic errors caused by Ldiskfs, the cause is unknown, the hardware underneath looks healthy. Netconsole dumps in attachement. Panics happen quite often, more or less once per day per server. We are trying to trace which job might be causing it, but it's not an easy task - if we catch the user/job, I'll update the ticket. Does anyone have any idea what can be wrong? Regards, |
| Comments |
| Comment by Peter Jones [ 19/Jul/11 ] |
|
Marek Are you running build 65 in production or is this some kind of test setup? Peter |
| Comment by Marek Magrys [ 19/Jul/11 ] |
|
Hello, It used to be a test setup, now we are running it in production environment, waiting for 'stable' 2.1 to appear. However not all the users have been moved to the new FS yet. Marek |
| Comment by Peter Jones [ 19/Jul/11 ] |
|
Marek That is a bold move and I would not have advised using this build for production purposes. However, this is certainly an excellent test for the code (which has been standing up well under test scenarios). I will have an engineer review the information attached. Regards Peter |
| Comment by Marek Magrys [ 19/Jul/11 ] |
|
Peter, I think it's better to catch some bugs before the main release, as we would probably face them then anyway Marek |
| Comment by Peter Jones [ 19/Jul/11 ] |
|
Bobi Does this relate to the work already in progess for LU216? Alex seems to think that this is the case. Peter |
| Comment by Zhenyu Xu [ 19/Jul/11 ] |
|
yes, it's dup of |
| Comment by Zhenyu Xu [ 20/Jul/11 ] |
|
lustre-ldiskfs-3.3.0-2.6.18_238.12.1.el5_lustre - which branch tag does this module build upon? I want to know this info because the patch of |
| Comment by Marek Magrys [ 20/Jul/11 ] |
|
That is a package from Jenkins, lustre-master, x86_64,server,el5,inkernel build #203. Now I've installed latest build for lustre-master, I just need to reload the modules and we'll se if the bug is still there, but according to your information the bug should be fixed some time ago, so probably nothing will change here. Anyway, let's wait and see what happens. |
| Comment by Peter Jones [ 25/Jul/11 ] |
|
Marek What was the outcome of this? Based on the information supplied it seemed as if the existing Thanks Peter |
| Comment by Lukasz Flis [ 25/Jul/11 ] |
|
Hello Peter, Marek is on a holiday. Let me comment on this: We are now running following versions on the server side: Since upgrade we haven't seen the error anymore (6 days). Lukasz Flis |
| Comment by Peter Jones [ 25/Jul/11 ] |
|
thanks Lukasz. Then I will not worry about this issue for now. |
| Comment by Marek Magrys [ 30/Jul/11 ] |
|
Hello, The error returned today, crashing 5 servers at [almost] once, so I guess that the fix from Regards, |
| Comment by Peter Jones [ 30/Jul/11 ] |
|
Reopening this ticket for further consideration |
| Comment by Zhenyu Xu [ 01/Aug/11 ] |
|
If I understand it right, from the log, panic happens on OST reading, is it? Would you mind grabing the crash dump and associated System.map and kernel image file for analysis? |
| Comment by Marek Magrys [ 01/Aug/11 ] |
|
If I understand correctly, you want us to move to kernel-debuginfo kernel image and fetch the crash dump, right? What is the proper way of doing it (do you have any guide or is it just fire-and-forget)? |
| Comment by Zhenyu Xu [ 01/Aug/11 ] |
|
yes, you can refer to http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes to config kdump, when the server panic, it will |
| Comment by Marek Magrys [ 17/Aug/11 ] |
|
We are waiting for crash to occur, but it stopped crashing after we enabled kdumps |
| Comment by Marek Magrys [ 03/Oct/11 ] |
|
The bug has struck again. We might have cought the user and his job with input, which caused the bug to appear, but we're still not sure if that was this job, we need to wait a few hours to confirm this information. However we've got the kdump, which I will pass to Peter directly by e-mail, as we don't want to make it public. We are now using Lustre 2.1RC2, with Ldiskfs 3.3.0. The kernel version is 2.6.18-238.19.1.el5_lustre.g65156ed. |
| Comment by Oleg Drokin [ 03/Oct/11 ] |
|
So were you using a jenkins build this time too? I see you referenced RC2, but the kernel still had a git hash in the version. proper rc2 build should not have a hash in the version. |
| Comment by Marek Magrys [ 03/Oct/11 ] |
|
Yes, it was the latest jenkins build available at this time, build #283 from lustre-master, el5, server, inkernel ofa. |
| Comment by Alexey Lyashkov [ 07/Oct/11 ] |
|
We hit that bug with 2.0.61 and RHEL6. |
| Comment by Oleg Drokin [ 07/Oct/11 ] |
|
a quick question, how many cpu cores do you have on the crashing system? |
| Comment by Marek Magrys [ 07/Oct/11 ] |
|
A quick answer: 12 |
| Comment by Prakash Surya (Inactive) [ 12/Oct/11 ] |
|
Any update on this issue? |
| Comment by Zhenyu Xu [ 12/Oct/11 ] |
|
extents:1920 locates at "BUG_ON(end <= start);", checking the source code ldiskfs_ext_walk_space() if (!ex) { /* there is no extent yet, so try to allocate * all requested space */ start = block; end = block + num; } else if (le32_to_cpu(ex->ee_block) > block) { /* need to allocate space before found extent */ start = block; end = le32_to_cpu(ex->ee_block); if (block + num < end) end = block + num; } else if (block >= le32_to_cpu(ex->ee_block) + ldiskfs_ext_get_actual_len(ex)) { /* need to allocate space after found extent */ start = block; end = block + num; if (end >= next) end = next; } else if (block >= le32_to_cpu(ex->ee_block)) { /* * some part of requested space is covered * by found extent */ start = block; end = le32_to_cpu(ex->ee_block) + ldiskfs_ext_get_actual_len(ex); if (block + num < end) end = block + num; exists = 1; } else { BUG(); } BUG_ON(end <= start); The only possible case for (end > start) is in the 3rd if block, where 'end' could possible be assigned with 'next' value, as Bzzz commented in
The 'next' value could be not consistent with 'path'. |
| Comment by Marek Magrys [ 21/Oct/11 ] |
|
Did you guys find something in the crash dump? The Bug is caused by user using Turbomole software, but it's rather hard to extract the reproducer code here. Do you have any ideas for the patch? |
| Comment by Zhenyu Xu [ 21/Oct/11 ] |
|
there's a patch at http://review.whamcloud.com/1492 (ORI-291) |
| Comment by Zhenyu Xu [ 26/Oct/11 ] |
|
master porting patch tracking at http://review.whamcloud.com/1618 |
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 24/Nov/11 ] |
|
Integrated in Result = SUCCESS
|