Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
-
3
-
9223372036854775807
Description
Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes lola-[8,9] from a failover cluster)
The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165
Sequence of events:
- 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
- 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
- 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9
- 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794 <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!! <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG <4>Pid: 6208, comm: ll_evictor <4> <4>Call Trace: <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81539407>] ? panic+0xa7/0x16f <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
I couldn't extract the debug from kernel dump
KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Mon Aug 1 10:03:45 2016 UPTIME: 2 days, 19:40:09 LOAD AVERAGE: 16.98, 16.25, 16.97 TASKS: 1536 NODENAME: lola-9.lola.whamcloud.com RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64 VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016 MACHINE: x86_64 (2693 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 6208 COMMAND: "ll_evictor" TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000] CPU: 25 STATE: TASK_RUNNING (PANIC) crash> extend /scratch/crash_lustre/lustre.so /scratch/crash_lustre/lustre.so: shared object loaded crash> lustre -l /scratch/lola-9-latest-crash.bin lustre_walk_cpus(0, 5, 1) cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
Attached files:
message, console, vmcore-dmesg.txt of node lola-9.