[LU-3010] client crashes on RHEL6 with Lustre 1.8.8 Created: 21/Mar/13  Updated: 06/Nov/13  Resolved: 06/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Kit Westneat (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-1734 ptlrpc_master_callback() LBUG Closed
Severity: 3
Rank (Obsolete): 7331

 Description   

After upgrading the clients at NOAA to RHEL6 and Lustre 1.8.8, we're running into an issue where we are seeing kernel panics. Here is the analysis by Redhat:

I have checked the provided vmcore and below is my diagnosis on same . . It
seems like libcfs module is leading to crash due to NULL pointer dereference.
libcfs is a 3rd party module not shipped by Red Hat and probably it is provided
by Lustre. Please check with Lustre application vendor for further investigation
on this. Below is more Detail analysis from vmcore . .

Go through below shared analysis and contact Lustre team . . If there is any
further assistance required from OS side do let us know. .

~~~~~~~~~~~~~~~~~~~~~~~~~~
[1] OOPS at "kiblnd_sd_04"

KERNEL: vmlinux
DUMPFILE: vmcore.flat [PARTIAL DUMP]
CPUS: 24
DATE: Thu Mar 7 07:10:37 2013
UPTIME: 1 days, 11:34:39
LOAD AVERAGE: 6.40, 6.05, 6.59
TASKS: 655
NODENAME: r10i0n5
RELEASE: 2.6.32-279.5.2.el6.x86_64
VERSION: #1 SMP Tue Aug 14 11:36:39 EDT 2012
MACHINE: x86_64 (3466 Mhz)
MEMORY: 24 GB
PANIC: "Oops: 0010 1 SMP " (check log for details)
PID: 2889
COMMAND: "kiblnd_sd_04"
TASK: ffff8803263b7540 [THREAD_INFO: ffff880325c14000]
CPU: 12
STATE: TASK_RUNNING (PANIC)

  • Following PID was the one which probably HIT the panic.
    crash> ps | grep 2889
    > 2889 2 12 ffff8803263b7540 RU 0.0 0 0 [kiblnd_sd_04]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~

[2] Seems like there was some resource constraint for the concern PID 2889.
crash> bt
PID: 2889 TASK: ffff8803263b7540 CPU: 12 COMMAND: "kiblnd_sd_04"
#0 [ffff880325c155a0] machine_kexec at ffffffff8103281b
#1 [ffff880325c15600] crash_kexec at ffffffff810ba8e2
#2 [ffff880325c156d0] oops_end at ffffffff81501510
#3 [ffff880325c15700] no_context at ffffffff81043bab
#4 [ffff880325c15750] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff880325c157a0] bad_area_nosemaphore at ffffffff81043f03
#6 [ffff880325c157b0] __do_page_fault at ffffffff81044661
#7 [ffff880325c158d0] do_page_fault at ffffffff815034ee
#8 [ffff880325c15900] page_fault at ffffffff815008a5
#9 [ffff880325c15a58] libcfs_debug_dumpstack at ffffffffa04808f5 [libcfs]
#10 [ffff880325c15a78] lbug_with_loc at ffffffffa0480f25 [libcfs]
#11 [ffff880325c15ac8] libcfs_assertion_failed at ffffffffa0489696 [libcfs]
#12 [ffff880325c15b18] lnet_match_md at ffffffffa04e7cdc [lnet]
#13 [ffff880325c15be8] lnet_parse at ffffffffa04ece8a [lnet]
#14 [ffff880325c15ce8] kiblnd_handle_rx at ffffffffa06ca2fb [ko2iblnd]
#15 [ffff880325c15d78] kiblnd_rx_complete at ffffffffa06caea2 [ko2iblnd]
#16 [ffff880325c15df8] kiblnd_complete at ffffffffa06cb092 [ko2iblnd]
#17 [ffff880325c15e38] kiblnd_scheduler at ffffffffa06cb3af [ko2iblnd]
#18 [ffff880325c15f48] kernel_thread at ffffffff8100c14a
~~~~~~~~~~~~~~~~~~~~~~~~~~

[3] I see few segfaults and LustreError prior to crash . .

crash> log

PROLOGUE-CHKNODE Wed Mar 6 18:55:45 UTC 2013 Job 16422368: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
remap_polar_net[778]: segfault at 100000009 ip 0000000000410f6c sp
00007fffffffd4f0 error 4 in remap_polar_netcdf.exe[400000+16c000]
EPILOGUE-CHKNODE Wed Mar 6 18:55:48 UTC 2013 Job 16422363: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422363.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
EPILOGUE-CHKNODE Wed Mar 6 18:55:51 UTC 2013 Job 16422308: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422308.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
remap_polar_net[1382]: segfault at 100000009 ip 0000000000410f6c sp
00007fffffffd530 error 4 in remap_polar_netcdf.exe[400000+16c000]
EPILOGUE-CHKNODE Wed Mar 6 18:55:56 UTC 2013 Job 16422368: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhpcs.noaa.gov
.
.
.
.
PROLOGUE-CHKNODE Wed Mar 6 22:41:51 UTC 2013 Job 16433865: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
LustreError: 12575:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
406359028
LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
406847678
EPILOGUE-CHKNODE Wed Mar 6 22:43:48 UTC 2013 Job 16433865: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
.
.
.
.
PROLOGUE-CHKNODE Thu Mar 7 11:53:11 UTC 2013 Job 16479113: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16479113.bqs1.zeus.fairmont.rdhpcs.noaa.gov
LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me)
failed
LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) LBUG
Pid: 2889, comm: kiblnd_sd_04
Call Trace:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<(null)>] (null)
PGD 334320067 PUD 336739067 PMD 0
Oops: 0010 1 SMP
last sysfs file:
/sys/devices/pci0000:00/0000:00:09.0/0000:02:00.0/infiniband/mlx4_1/hca_type
CPU 12
Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa
mlx4_ib ib_mad iw_cxgb4 iw_cxgb3 ib_core xpmem(U) xp gru xvma(U) numatools(U)
microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma
shpchp ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs
lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb4i
cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
scsi_transport_iscsi [last unloaded: ipmi_msghandler]

Pid: 2889, comm: kiblnd_sd_04 Not tainted 2.6.32-279.5.2.el6.x86_64 #1 SGI.COM
AltixICE8400IP105/X8DTT-HallieS
RIP: 0010:[<0000000000000000>] [<(null)>] (null)
RSP: 0018:ffff880325c159b8 EFLAGS: 00010246
RAX: ffff880325c15a1c RBX: ffff880325c15a10 RCX: ffffffffa048c320
RDX: ffff880325c15a50 RSI: ffff880325c15a10 RDI: ffff880325c14000
RBP: ffff880325c15a50 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0
R13: ffffffffa048c320 R14: 0000000000000000 R15: ffff8800282c3fc0
FS: 00007ffff7fe8700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000032c5fd000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kiblnd_sd_04 (pid: 2889, threadinfo ffff880325c14000, task ffff8803263b7540)
Stack:
ffffffff8100e520 ffff880325c15a1c ffff8803263b7540 ffffffffa04f8914
<d> 00000000a04fb8d8 ffff880325c14000 ffff880325c15fd8 ffff880325c14000
<d> 000000000000000c ffff8800282c0000 ffff880325c15a50 ffff880325c15a20
Call Trace:
[<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
[<ffffffffa04808f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0480f25>] lbug_with_loc+0x75/0xe0 [libcfs]
[<ffffffffa0489696>] libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa04e7cdc>] lnet_match_md+0x2fc/0x350 [lnet]
[<ffffffffa048172e>] ? cfs_free+0xe/0x10 [libcfs]
[<ffffffffa04f73af>] ? lnet_nid2peer_locked+0x2f/0x540 [lnet]
[<ffffffffa04ece8a>] lnet_parse+0x108a/0x1b30 [lnet]
[<ffffffffa06ca2fb>] kiblnd_handle_rx+0x2cb/0x600 [ko2iblnd]
[<ffffffff8104f2bd>] ? check_preempt_curr+0x6d/0x90
[<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffffa06caea2>] kiblnd_rx_complete+0x252/0x3e0 [ko2iblnd]
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
[<ffffffffa06cb092>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
[<ffffffffa06cb3af>] kiblnd_scheduler+0x29f/0x760 [ko2iblnd]
[<ffffffff81127e40>] ? __free_pages+0x60/0xa0
[<ffffffff81060250>] ? default_wake_function+0x0/0x20
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa06cb110>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: Bad RIP value.
RIP [<(null)>] (null)
RSP <ffff880325c159b8>
CR2: 0000000000000000
~~~~~~~~~~~~~~~~~~~~~~~~~~



 Comments   
Comment by Kit Westneat (Inactive) [ 21/Mar/13 ]

Here are the messages:
http://eu.ddn.com:8080/lustre/DDN-SR19805-r7i1n8.messages.gz

Before the crash there appears to be a network problem:
Mar 11 13:39:47 r7i1n8 kernel: LustreError: 3229:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: tx_queue, 3 seconds
Mar 11 13:39:47 r7i1n8 kernel: LustreError: 3229:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 10.175.31.242@o2ib4 (65)

Here is the vmcore:
http://eu.ddn.com:8080/lustre/DDN-SR19805-r7i1n8.dump.tar.gz

Let me know if there is any more information I can provide.

Comment by Oleg Drokin [ 21/Mar/13 ]

is this from our rpms or did you build it yourself? If you built it, we also need kernel-debuginfo rpm and lustre-modules rpm to make the crashdump useful.

Comment by Oleg Drokin [ 21/Mar/13 ]

Also the assertion is the same as old bugzilla 14238 that we landed a patch for quite a while ago (patch by Isaac http://git.whamcloud.com/gitweb?p=fs/lustre-dev.git;a=commitdiff;h=85f59695534ddd167fa491c091ed64b1504cdaf7 )

Comment by Doug Oucharek (Inactive) [ 22/Mar/13 ]

I'm seeing two things:

1- The log "Timed out RDMA.." is indicating that we have not seen any activity from 10.175.31.242@o2ib4 for 65 seconds. As such, we are giving up on it and closing the connection.

2- There is only one assert in lnet_match_md() and that is when we see something unexpected with our MD queue. I am assuming this has happened as a result of the closed connection. But, that is no reason to assert (if something can feasibly happen, we should not be asserting).

I checked the latest code tree and there seems to be some changes to how locks work in this area of code. It may be possible there is a race condition here.

I need to check with a couple of other devs more familiar with this area of code to get their opinion.

Comment by Liang Zhen (Inactive) [ 22/Mar/13 ]

first, the assertion is reasonable at here, no matter what happened, @me has to equal to @me->me_md->md_me unless me->me_md is NULL, that's the way we implement this. Because locking changes are only on 2.3 or later, and this bug is on 1.8.8, so at least it's not a new race condition, and unfortunately I didn't see any race even when I make those lock changes.
I believe the first time we saw this is bz11130, and there is no fix for it...
let's see what Isaac will comment for this, and my suggestion is to change LASSERT to LASSERTF and try to see if "me", "me->me_md", "me->me_md->md_me" are not polluted by something else.

Comment by Oleg Drokin [ 22/Mar/13 ]

Liang, there's a crashdump available, so we can check all the values there.

Comment by Kit Westneat (Inactive) [ 22/Mar/13 ]

Hi Oleg,

We built them ourselves. Here's the kernel-debug:
http://vault.centos.org/6.3/updates/x86_64/Packages/kernel-debug-2.6.32-279.5.2.el6.x86_64.rpm

Here's the lustre-modules:
http://eu.ddn.com:8080/lustre/lustre-client-modules-1.8.8-wc1_2.6.32_279.5.2.el6.x86_64_gbc88c4c.x86_64.rpm

We haven't been able to find the lustre-modules debuginfo yet..is that also needed? If so, could we potentially try to rebuild identical modules in order to regenerate the debug info?

Thanks,
Kit

Comment by Oleg Drokin [ 22/Mar/13 ]

Thanks.
So the clients run unmodified kernel and we can just grab the debuginfo for it from centos?

As far as lustre debug info goes, we don't really strip debug symbols in 1.8 (and in 2.x prior to 2.4), so just lustre modules rpm is enough, debug module symbols are embedded in the modules themselves.

Comment by Isaac Huang (Inactive) [ 22/Mar/13 ]

I believe it's a bug in the generic LNet layer, i.e. under lnet/lnet. It has happened over different LNDs in the past, and shouldn't be LND specific. My patch in Bug 14238 wasn't a real fix - it just closed a couple of cases where corruption or dangling pointer could happen. The root cause wasn't found out, due to lack of debug information. But this time we have a good dump.

Comment by Kit Westneat (Inactive) [ 22/Mar/13 ]

Oleg, yes, it is an unmodified kernel.

FWIW, we are also seeing this at another customer site, but only their RHEL6 frontends are affected. In that case appears to be coincident with OOM errors, so maybe it's related to memory pressure?

Thanks

Comment by Kit Westneat (Inactive) [ 26/Mar/13 ]

I was wondering how the analysis was going. Is there anything we can do to help?

Comment by Isaac Huang (Inactive) [ 27/Mar/13 ]

Hi Kit,

The kernel-debug-2.6.32-279.5.2.el6.x86_64.rpm you pointed to was a kernel with all sorts of run-time debugging options enabled. Usually it's not used in production system. Also, when I used the vmlinux from the corresponding debuginfo RPM, crash wouldn't even start. Then I tried to use the debuginfo RPM for normal (i.e. without debugging options) kernel of the same verion, crash would start with a warning on kernel version inconsistency between vmlinux and dumpfile. Can you please verify the client kernel version and point me to where the vmlinux is?

With vmlinux from kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm, I got a panic that's different from the one reported. Instead of "(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me) failed" in process kiblnd_sd_04, I got:
(events.c:418:ptlrpc_master_callback()) ASSERTION(callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback) failed
in kiblnd_sd_01.

Comment by Kit Westneat (Inactive) [ 27/Mar/13 ]

Hi Isaac,

It looks like the RHEL and Centos kernel RPMs are slightly different, and they were running the RHEL one. I am getting the RHEL debuginfo currently and will update the ticket when I get it uploaded. If you happen to have an RHN account, that should work too.

Comment by Kit Westneat (Inactive) [ 28/Mar/13 ]

Here are the kerneldebug rpms:
http://eu.ddn.com:8080/lustre/kernel-debuginfo-2.6.32-279.19.1.el6.x86_64.rpm
http://eu.ddn.com:8080/lustre/kernel-debuginfo-common-x86_64-2.6.32-279.19.1.el6.x86_64.rpm

Thanks,
Kit

Comment by Isaac Huang (Inactive) [ 28/Mar/13 ]

Strange, with the RHEL debuginfo, I got the same version warning and then crash failed with other errors. I'll continue to debug with the CentOS debuginfo which seemed to work despite the version warning. Also, it seemed that the original report was based on another crash dump, because both the DATE and UPTIME in the dump differed. It'd give me more data if that dump is also available. Although this dump had a different failure, it's similar in many ways.

Comment by Kit Westneat (Inactive) [ 28/Mar/13 ]

Oh doh, I uploaded the wrong debuginfo, 279.19.1 vs 279.5.2. I am uploading the correct one now and will send you the link.

Comment by Kit Westneat (Inactive) [ 28/Mar/13 ]

correct kerneldebug rpms:
http://eu.ddn.com:8080/lustre/kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm
http://eu.ddn.com:8080/lustre/kernel-debuginfo-common-x86_64-2.6.32-279.5.2.el6.x86_64.rpm

Comment by Isaac Huang (Inactive) [ 28/Mar/13 ]

Thanks - this one worked perfectly, no warning at all.

Comment by Isaac Huang (Inactive) [ 29/Mar/13 ]

I've been digging through the crash dump, and it appeared that the kernel slab allocation states got corrupted somehow.

The assertion failure on callback pointer in ptlrpc_master_callback() was an indication of memory corruption because the 'callback' pointer from the MD object was NEVER changed by the code after initialization. It looked almost impossible to be a result of racing code - no code changes that 'callback' pointer at all. Use-after-free can also be ruled out because: the MD object is bigger than a page, and crash dump contained only the initial part of the object, because the rest of the object resided in a free page that was not included in the partial dump - if it were use-after-free, both pages should have been excluded from the dump. Moreover, the initial part of the MD object contained correct reference counter. So I was led to believe that the SLAB allocation states got screwed up and a same chunk of memory was returned to multiple callers, and thus the 'callback' pointer got messed up by someone else unknowingly.

In the past, we've seen similar bugs where the root cause was never found out, and the solution was to disable some experimental Lustre feature that looked suspicious. In fact, the culprit might not be Lustre at all - any kernel code is technically capable of such screw-ups. My suggestion is to:
1. Kit, would it be possible to run the kernel-debug RPM (and corresponding Lustre modules which might need to be rebuilt) on a couple of clients where the error has been observed? The kernel-debug kernel should have enabled many SLAB and VM debugging options, and I think that'd catch any SLAB/VM issues much earlier and would move us closer to the root cause.

2. Doug, I think it'd make sense to have some Lustre folk to double check the errors from file system level. I was glancing over the code, and some didn't look good, e.g.:
ll_file_join():
2696 tail = igrab(tail_filp->f_dentry->d_inode);

igrab() can return NULL yet the code didn't check that. Maybe ll_file_join() wasn't even compiled, it was just something that raised my eyebrows.

Comment by Kit Westneat (Inactive) [ 01/Apr/13 ]

Hi Isaac,

Thanks for the analysis. I will try to get the debug kernel installed, but it might take a little bit of time. Would the full vmcore be of any use? Also this seems to appear fairly consistently with RHEL6 clients. Do you think it would be possible to look at differences in the kernel functions that the Lustre client calls? Perhaps there are too many, but if at all possible, it would be good to keep looking for the cause.

Comment by Oleg Drokin [ 01/Apr/13 ]

Isaac: ll_file_join is (was) a known problematic area that was removed in later versions.
It's not used by anybody anyway.

Comment by Isaac Huang (Inactive) [ 01/Apr/13 ]

Hi Kit, a vmcore with free pages included would be a bit more helpful. Also it doesn't require kdump to go through data structures tracking free pages, so in case that those structures also get corrupted kdump would still be able to create the dump - kdump may hang if asked to exclude free pages but those data structures got corrupted. At this point, I think kernel-bug is the best way to go - we'd be extremely lucky to be able to find something useful by going through that much code almost blindly, like trying to walk out of a dark rain forest with little guide. The callback pointer corruption was a good indication that somebody else stepped on our toes because our code doesn't change it at all and I've already double checked all pointer arithmetic code in lnet to make sure that it was not ourselves that shot our own feet.

Oleg, thanks the comment on file join. I saw errors like ...:
LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode 406847678

... before the assertion happened. Does it ring any alarm to you?

Comment by Oleg Drokin [ 01/Apr/13 ]

Isaac, this error is a somewhat usual race.
Wht it means is the client believes a particular name is valid, yet when it tries to get inode attributes, the file is already gone. Could happen frequently in rm vs find/ls workloads.

Comment by Kit Westneat (Inactive) [ 09/Apr/13 ]

Isaac,

I've gotten a lot more stack traces from 1.8.9 clients. Some of them are only in the IB functions, not Lustre related, which I find interesting. Are you aware of any memory corruption bugs in the RHEL6 line of RDMA modules? I spent some time looking, but couldn't find anything definitive. There is:

BZ#873949
Previously, the IP over Infiniband (IPoIB) driver maintained state information about neighbors on the network by attaching it to the core network's neighbor structure. However, due to a race condition between the freeing of the core network neighbor struct and the freeing of the IPoIB network struct, a use after free condition could happen, resulting in either a kernel oops or 4 or 8 bytes of kernel memory being zeroed when it was not supposed to be. These patches decouple the IPoIB neighbor struct from the core networking stack's neighbor struct so that there is no race between the freeing of one and the freeing of the other.

There is also a crash in Lustre code that doesn't seem to be related to IB:
IP: [<ffffffffa0761569>] lov_change_cbdata+0xd9/0x780 [lov]
which translates to this line:
2247 rc = obd_change_cbdata(lov->lov_tgts[loi->loi_ost_idx]->ltd_exp,
2248 &submd, it, data);

So it's very confusing. Here are all the stack traces if you would like to take a look at them:
ftp://shell.sgi.com/collect/NOAADDN

Our next step I think is to put the debug kernel on some of the clients and see what happens. It's been impossible to find any pattern to what triggers the crashes, so it will be only by random chance that we reproduce it. Let me know if you see anything in the stack traces, or if you would like to see a vmcore.

Do you know if there are already any prebuilt client RPMs against the kernel-debug RPM that we could use?

Thanks.

Comment by Isaac Huang (Inactive) [ 10/Apr/13 ]

Hi, I've looked at the crashes and they all appeared like memory corruptions: bad pointer dereferences, inconsistent data structures caught by assertions, and even a BUG in mm/slab.c. They seemed to have been triggered by memory pressures and increased chances of racing through the 24 CPUs.

The crashes in IB had nothing to do with Lustre - the o2iblnd never uses the IB functions involved in the crashes. In the past, there had a been a few memory corruption issues fixed in OFED like the one you pointed out. But so far no clue has pointed at OFED yet - anything in the kernel space could have been the culprit. I still believe that kernel-debug would shed some light and move us closer to the root cause. Unfortunately we don't build RPMs for kernel-debug.

Another option would be to upgrade some clients to RHEL 6.4 which included the fix of the IB memory corruption you mentioned, but that's likely more work than trying kernel-debug of 6.3.

Comment by Andreas Dilger [ 06/Nov/13 ]

Per Isaac's comments, this appears to be some form of memory corruption, possibly related to the IB code in the RHEL kernel.

Generated at Sat Feb 10 01:30:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.