Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 1.8.8
-
None
-
3
-
7331
Description
After upgrading the clients at NOAA to RHEL6 and Lustre 1.8.8, we're running into an issue where we are seeing kernel panics. Here is the analysis by Redhat:
I have checked the provided vmcore and below is my diagnosis on same . . It
seems like libcfs module is leading to crash due to NULL pointer dereference.
libcfs is a 3rd party module not shipped by Red Hat and probably it is provided
by Lustre. Please check with Lustre application vendor for further investigation
on this. Below is more Detail analysis from vmcore . .
Go through below shared analysis and contact Lustre team . . If there is any
further assistance required from OS side do let us know. .
~~~~~~~~~~~~~~~~~~~~~~~~~~
[1] OOPS at "kiblnd_sd_04"
KERNEL: vmlinux
DUMPFILE: vmcore.flat [PARTIAL DUMP]
CPUS: 24
DATE: Thu Mar 7 07:10:37 2013
UPTIME: 1 days, 11:34:39
LOAD AVERAGE: 6.40, 6.05, 6.59
TASKS: 655
NODENAME: r10i0n5
RELEASE: 2.6.32-279.5.2.el6.x86_64
VERSION: #1 SMP Tue Aug 14 11:36:39 EDT 2012
MACHINE: x86_64 (3466 Mhz)
MEMORY: 24 GB
PANIC: "Oops: 0010 1 SMP " (check log for details)
PID: 2889
COMMAND: "kiblnd_sd_04"
TASK: ffff8803263b7540 [THREAD_INFO: ffff880325c14000]
CPU: 12
STATE: TASK_RUNNING (PANIC)
- Following PID was the one which probably HIT the panic.
crash> ps | grep 2889
> 2889 2 12 ffff8803263b7540 RU 0.0 0 0 [kiblnd_sd_04]
~~~~~~~~~~~~~~~~~~~~~~~~~~
[2] Seems like there was some resource constraint for the concern PID 2889.
crash> bt
PID: 2889 TASK: ffff8803263b7540 CPU: 12 COMMAND: "kiblnd_sd_04"
#0 [ffff880325c155a0] machine_kexec at ffffffff8103281b
#1 [ffff880325c15600] crash_kexec at ffffffff810ba8e2
#2 [ffff880325c156d0] oops_end at ffffffff81501510
#3 [ffff880325c15700] no_context at ffffffff81043bab
#4 [ffff880325c15750] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff880325c157a0] bad_area_nosemaphore at ffffffff81043f03
#6 [ffff880325c157b0] __do_page_fault at ffffffff81044661
#7 [ffff880325c158d0] do_page_fault at ffffffff815034ee
#8 [ffff880325c15900] page_fault at ffffffff815008a5
#9 [ffff880325c15a58] libcfs_debug_dumpstack at ffffffffa04808f5 [libcfs]
#10 [ffff880325c15a78] lbug_with_loc at ffffffffa0480f25 [libcfs]
#11 [ffff880325c15ac8] libcfs_assertion_failed at ffffffffa0489696 [libcfs]
#12 [ffff880325c15b18] lnet_match_md at ffffffffa04e7cdc [lnet]
#13 [ffff880325c15be8] lnet_parse at ffffffffa04ece8a [lnet]
#14 [ffff880325c15ce8] kiblnd_handle_rx at ffffffffa06ca2fb [ko2iblnd]
#15 [ffff880325c15d78] kiblnd_rx_complete at ffffffffa06caea2 [ko2iblnd]
#16 [ffff880325c15df8] kiblnd_complete at ffffffffa06cb092 [ko2iblnd]
#17 [ffff880325c15e38] kiblnd_scheduler at ffffffffa06cb3af [ko2iblnd]
#18 [ffff880325c15f48] kernel_thread at ffffffff8100c14a
~~~~~~~~~~~~~~~~~~~~~~~~~~
[3] I see few segfaults and LustreError prior to crash . .
crash> log
PROLOGUE-CHKNODE Wed Mar 6 18:55:45 UTC 2013 Job 16422368: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
remap_polar_net[778]: segfault at 100000009 ip 0000000000410f6c sp
00007fffffffd4f0 error 4 in remap_polar_netcdf.exe[400000+16c000]
EPILOGUE-CHKNODE Wed Mar 6 18:55:48 UTC 2013 Job 16422363: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422363.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
EPILOGUE-CHKNODE Wed Mar 6 18:55:51 UTC 2013 Job 16422308: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422308.bqs1.zeus.fairmont.rdhp
cs.noaa.gov
remap_polar_net[1382]: segfault at 100000009 ip 0000000000410f6c sp
00007fffffffd530 error 4 in remap_polar_netcdf.exe[400000+16c000]
EPILOGUE-CHKNODE Wed Mar 6 18:55:56 UTC 2013 Job 16422368: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16422368.bqs1.zeus.fairmont.rdhpcs.noaa.gov
.
.
.
.
PROLOGUE-CHKNODE Wed Mar 6 22:41:51 UTC 2013 Job 16433865: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
LustreError: 12575:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
406359028
LustreError: 12905:0:(file.c:3331:ll_inode_revalidate_fini()) failure -2 inode
406847678
EPILOGUE-CHKNODE Wed Mar 6 22:43:48 UTC 2013 Job 16433865: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16433865.bqs1.zeus.fairmont.rdhpcs.noaa.gov
.
.
.
.
PROLOGUE-CHKNODE Thu Mar 7 11:53:11 UTC 2013 Job 16479113: chk_node.pl
2.3-ad-2012.06.22. Starting Checks ... 16479113.bqs1.zeus.fairmont.rdhpcs.noaa.gov
LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me)
failed
LustreError: 2889:0:(lib-move.c:184:lnet_match_md()) LBUG
Pid: 2889, comm: kiblnd_sd_04
Call Trace:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<(null)>] (null)
PGD 334320067 PUD 336739067 PMD 0
Oops: 0010 1 SMP
last sysfs file:
/sys/devices/pci0000:00/0000:00:09.0/0000:02:00.0/infiniband/mlx4_1/hca_type
CPU 12
Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa
mlx4_ib ib_mad iw_cxgb4 iw_cxgb3 ib_core xpmem(U) xp gru xvma(U) numatools(U)
microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma
shpchp ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs
lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb4i
cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
scsi_transport_iscsi [last unloaded: ipmi_msghandler]
Pid: 2889, comm: kiblnd_sd_04 Not tainted 2.6.32-279.5.2.el6.x86_64 #1 SGI.COM
AltixICE8400IP105/X8DTT-HallieS
RIP: 0010:[<0000000000000000>] [<(null)>] (null)
RSP: 0018:ffff880325c159b8 EFLAGS: 00010246
RAX: ffff880325c15a1c RBX: ffff880325c15a10 RCX: ffffffffa048c320
RDX: ffff880325c15a50 RSI: ffff880325c15a10 RDI: ffff880325c14000
RBP: ffff880325c15a50 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0
R13: ffffffffa048c320 R14: 0000000000000000 R15: ffff8800282c3fc0
FS: 00007ffff7fe8700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000032c5fd000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kiblnd_sd_04 (pid: 2889, threadinfo ffff880325c14000, task ffff8803263b7540)
Stack:
ffffffff8100e520 ffff880325c15a1c ffff8803263b7540 ffffffffa04f8914
<d> 00000000a04fb8d8 ffff880325c14000 ffff880325c15fd8 ffff880325c14000
<d> 000000000000000c ffff8800282c0000 ffff880325c15a50 ffff880325c15a20
Call Trace:
[<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
[<ffffffffa04808f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0480f25>] lbug_with_loc+0x75/0xe0 [libcfs]
[<ffffffffa0489696>] libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa04e7cdc>] lnet_match_md+0x2fc/0x350 [lnet]
[<ffffffffa048172e>] ? cfs_free+0xe/0x10 [libcfs]
[<ffffffffa04f73af>] ? lnet_nid2peer_locked+0x2f/0x540 [lnet]
[<ffffffffa04ece8a>] lnet_parse+0x108a/0x1b30 [lnet]
[<ffffffffa06ca2fb>] kiblnd_handle_rx+0x2cb/0x600 [ko2iblnd]
[<ffffffff8104f2bd>] ? check_preempt_curr+0x6d/0x90
[<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffffa06caea2>] kiblnd_rx_complete+0x252/0x3e0 [ko2iblnd]
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff8104e309>] ? __wake_up_common+0x59/0x90
[<ffffffffa06cb092>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
[<ffffffffa06cb3af>] kiblnd_scheduler+0x29f/0x760 [ko2iblnd]
[<ffffffff81127e40>] ? __free_pages+0x60/0xa0
[<ffffffff81060250>] ? default_wake_function+0x0/0x20
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa06cb110>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: Bad RIP value.
RIP [<(null)>] (null)
RSP <ffff880325c159b8>
CR2: 0000000000000000
~~~~~~~~~~~~~~~~~~~~~~~~~~
Attachments
Issue Links
- is duplicated by
-
LU-1734 ptlrpc_master_callback() LBUG
-
- Closed
-
Per Isaac's comments, this appears to be some form of memory corruption, possibly related to the IB code in the RHEL kernel.