[LU-508] Kernel panic on ...BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/extents.c:1920 Created: 19/Jul/11  Updated: 27/Mar/12  Resolved: 28/Nov/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.2.0, Lustre 2.1.1

Type: Bug Priority: Critical
Reporter: Marek Magrys Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 5.5 with:
kernel-2.6.18-238.12.1.el5_lustre
lustre-2.0.65-2.6.18_238.12.1.el5_lustre
lustre-ldiskfs-3.3.0-2.6.18_238.12.1.el5_lustre


Attachments: File paniclogs.tar.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 4841

 Description   

Hello,

We observe a subsequent Kernel panic errors caused by Ldiskfs, the cause is unknown, the hardware underneath looks healthy. Netconsole dumps in attachement. Panics happen quite often, more or less once per day per server. We are trying to trace which job might be causing it, but it's not an easy task - if we catch the user/job, I'll update the ticket. Does anyone have any idea what can be wrong?

Regards,
Marek Magrys



 Comments   
Comment by Peter Jones [ 19/Jul/11 ]

Marek

Are you running build 65 in production or is this some kind of test setup?

Peter

Comment by Marek Magrys [ 19/Jul/11 ]

Hello,

It used to be a test setup, now we are running it in production environment, waiting for 'stable' 2.1 to appear. However not all the users have been moved to the new FS yet.

Marek

Comment by Peter Jones [ 19/Jul/11 ]

Marek

That is a bold move and I would not have advised using this build for production purposes. However, this is certainly an excellent test for the code (which has been standing up well under test scenarios). I will have an engineer review the information attached.

Regards

Peter

Comment by Marek Magrys [ 19/Jul/11 ]

Peter,

I think it's better to catch some bugs before the main release, as we would probably face them then anyway We of course are aware, that using the 2.1 code is still rather an extravagant move.

Marek

Comment by Peter Jones [ 19/Jul/11 ]

Bobi

Does this relate to the work already in progess for LU216? Alex seems to think that this is the case.

Peter

Comment by Zhenyu Xu [ 19/Jul/11 ]

yes, it's dup of LU-216.

Comment by Zhenyu Xu [ 20/Jul/11 ]

lustre-ldiskfs-3.3.0-2.6.18_238.12.1.el5_lustre - which branch tag does this module build upon?

I want to know this info because the patch of LU-216 has already been included in the 2.0.65 build, while I can not tell whether it's in ldiskfs-3.3.0 or not, the ldiskfs version number has not been changed for a long time.

Comment by Marek Magrys [ 20/Jul/11 ]

That is a package from Jenkins, lustre-master, x86_64,server,el5,inkernel build #203. Now I've installed latest build for lustre-master, I just need to reload the modules and we'll se if the bug is still there, but according to your information the bug should be fixed some time ago, so probably nothing will change here. Anyway, let's wait and see what happens.

Comment by Peter Jones [ 25/Jul/11 ]

Marek

What was the outcome of this? Based on the information supplied it seemed as if the existing LU-216 fix should already have been in place so I am surprised if that was the solution

Thanks

Peter

Comment by Lukasz Flis [ 25/Jul/11 ]

Hello Peter,

Marek is on a holiday. Let me comment on this:

We are now running following versions on the server side:
lustre-2.0.65-2.6.18_238.12.1.el5_lustre_ga34dd87
lustre-modules-2.0.65-2.6.18_238.12.1.el5_lustre_ga34dd87
lustre-ldiskfs-3.3.0-2.6.18_238.12.1.el5_lustre_ga34dd87

Since upgrade we haven't seen the error anymore (6 days).
I think the ticket can be closed. If the problem reappears we will let you know.

Lukasz Flis

ACC Cyfronet AGH
Cracow, Poland

Comment by Peter Jones [ 25/Jul/11 ]

thanks Lukasz. Then I will not worry about this issue for now.

Comment by Marek Magrys [ 30/Jul/11 ]

Hello,

The error returned today, crashing 5 servers at [almost] once, so I guess that the fix from LU-216 didn't do the trick. We are still working on tracing the user which can cause the crashes, as the corelation is rather not accidental.

Regards,
Marek Magrys

Comment by Peter Jones [ 30/Jul/11 ]

Reopening this ticket for further consideration

Comment by Zhenyu Xu [ 01/Aug/11 ]

If I understand it right, from the log, panic happens on OST reading, is it?

Would you mind grabing the crash dump and associated System.map and kernel image file for analysis?

Comment by Marek Magrys [ 01/Aug/11 ]

If I understand correctly, you want us to move to kernel-debuginfo kernel image and fetch the crash dump, right? What is the proper way of doing it (do you have any guide or is it just fire-and-forget)?
It's hard to trace if the crash happens on reads or writes, because the workload is mixed. However we are trying to find out which user's application(if any particular) is causing the bug to appear.

Comment by Zhenyu Xu [ 01/Aug/11 ]

yes, you can refer to http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes to config kdump, when the server panic, it will
trigger booting into capture kernel and save a kernel dump.

Comment by Marek Magrys [ 17/Aug/11 ]

We are waiting for crash to occur, but it stopped crashing after we enabled kdumps Whatever was causing the panic will probably return, however maybe we should close this bug and I'll ask for reopen if the problem comes back, hopefully I'll be able to provide kernel dumps then.

Comment by Marek Magrys [ 03/Oct/11 ]

The bug has struck again. We might have cought the user and his job with input, which caused the bug to appear, but we're still not sure if that was this job, we need to wait a few hours to confirm this information. However we've got the kdump, which I will pass to Peter directly by e-mail, as we don't want to make it public. We are now using Lustre 2.1RC2, with Ldiskfs 3.3.0. The kernel version is 2.6.18-238.19.1.el5_lustre.g65156ed.

Comment by Oleg Drokin [ 03/Oct/11 ]

So were you using a jenkins build this time too? I see you referenced RC2, but the kernel still had a git hash in the version. proper rc2 build should not have a hash in the version.

Comment by Marek Magrys [ 03/Oct/11 ]

Yes, it was the latest jenkins build available at this time, build #283 from lustre-master, el5, server, inkernel ofa.

Comment by Alexey Lyashkov [ 07/Oct/11 ]

We hit that bug with 2.0.61 and RHEL6.
patch to intoduce a WALK_SPACE_HAS_DATA_SEM check already exist at that point.

Comment by Oleg Drokin [ 07/Oct/11 ]

a quick question, how many cpu cores do you have on the crashing system?

Comment by Marek Magrys [ 07/Oct/11 ]

A quick answer: 12

Comment by Prakash Surya (Inactive) [ 12/Oct/11 ]

Any update on this issue?

Comment by Zhenyu Xu [ 12/Oct/11 ]

extents:1920 locates at "BUG_ON(end <= start);", checking the source code

ldiskfs_ext_walk_space()
                if (!ex) {
                        /* there is no extent yet, so try to allocate
                         * all requested space */
                        start = block;
                        end = block + num;
                } else if (le32_to_cpu(ex->ee_block) > block) {
                        /* need to allocate space before found extent */
                        start = block;
                        end = le32_to_cpu(ex->ee_block);
                        if (block + num < end)
                                end = block + num;
                } else if (block >= le32_to_cpu(ex->ee_block)
                                        + ldiskfs_ext_get_actual_len(ex)) {
                        /* need to allocate space after found extent */
                        start = block;
                        end = block + num;
                        if (end >= next)
                                end = next;
                } else if (block >= le32_to_cpu(ex->ee_block)) {
                        /*
                         * some part of requested space is covered
                         * by found extent
                         */
                        start = block;
                        end = le32_to_cpu(ex->ee_block)
                                + ldiskfs_ext_get_actual_len(ex);
                        if (block + num < end)
                                end = block + num;
                        exists = 1;
                } else {
                        BUG();
                }
                BUG_ON(end <= start);

The only possible case for (end > start) is in the 3rd if block, where 'end' could possible be assigned with 'next' value, as Bzzz commented in LU-216

Alex Zhuravlev added a comment - 06/Sep/11 7:35 AM
I tend to think ldiskfs_ext_next_allocated_block() should be called under i_data_sem together with ldiskfs_ext_find_extent(), otherwise ldiskfs_ext_next_allocated_block() is working on data being modified

The 'next' value could be not consistent with 'path'.

Comment by Marek Magrys [ 21/Oct/11 ]

Did you guys find something in the crash dump? The Bug is caused by user using Turbomole software, but it's rather hard to extract the reproducer code here. Do you have any ideas for the patch?

Comment by Zhenyu Xu [ 21/Oct/11 ]

there's a patch at http://review.whamcloud.com/1492 (ORI-291)

Comment by Zhenyu Xu [ 26/Oct/11 ]

master porting patch tracking at http://review.whamcloud.com/1618

Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,client,el5,ofa #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,server,el5,ofa #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 24/Nov/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #364
LU-508 ldiskfs: fix race in ext4_ext_walk_space() (Revision 48452fbe583cf365d3c1f5be3c4272d30e198781)

Result = SUCCESS
Oleg Drokin : 48452fbe583cf365d3c1f5be3c4272d30e198781
Files :

  • lustre/lvfs/fsfilt_ext3.c
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel6.series
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-store-tree-generation-at-find.patch
Generated at Sat Feb 10 01:07:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.