Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
-
3
-
9223372036854775807
Description
Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes lola-[8,9] from a failover cluster)
The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165
Sequence of events:
- 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
- 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
- 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9
- 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794 <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!! <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG <4>Pid: 6208, comm: ll_evictor <4> <4>Call Trace: <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81539407>] ? panic+0xa7/0x16f <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
I couldn't extract the debug from kernel dump
KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Mon Aug 1 10:03:45 2016 UPTIME: 2 days, 19:40:09 LOAD AVERAGE: 16.98, 16.25, 16.97 TASKS: 1536 NODENAME: lola-9.lola.whamcloud.com RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64 VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016 MACHINE: x86_64 (2693 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 6208 COMMAND: "ll_evictor" TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000] CPU: 25 STATE: TASK_RUNNING (PANIC) crash> extend /scratch/crash_lustre/lustre.so /scratch/crash_lustre/lustre.so: shared object loaded crash> lustre -l /scratch/lola-9-latest-crash.bin lustre_walk_cpus(0, 5, 1) cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
Attached files:
message, console, vmcore-dmesg.txt of node lola-9.
Attachments
Activity
Resolution | New: Cannot Reproduce [ 5 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Link | New: This issue is related to ST-66 [ ST-66 ] |
Priority | Original: Blocker [ 1 ] | New: Major [ 3 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Mikhail Pershin [ tappro ] |
Fix Version/s | New: Lustre 2.9.0 [ 11891 ] |
Description |
Original:
Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727) OSTs formatted with zfs, MDSs formatted with ldiskfs DNE is enabled, HSM/robinhood enable and integrated 4 MDSs with 1 MDT / MDS 6 OSSs with 4 OSTs / OSS Server nodes configured in active-active HA confguration (Nodes {{lola-[8,9]}} from a failover cluster) The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165 Sequence of events: * 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover * 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!! * 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO * 2016-08-01 11:28:20,152:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9 (NOTE: this took 2 hours) * 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic: {noformat} ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794 <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!! <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG <4>Pid: 6208, comm: ll_evictor <4> <4>Call Trace: <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81539407>] ? panic+0xa7/0x16f <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 {noformat} I couldn't extract the debug from kernel dump {noformat} KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Mon Aug 1 10:03:45 2016 UPTIME: 2 days, 19:40:09 LOAD AVERAGE: 16.98, 16.25, 16.97 TASKS: 1536 NODENAME: lola-9.lola.whamcloud.com RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64 VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016 MACHINE: x86_64 (2693 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 6208 COMMAND: "ll_evictor" TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000] CPU: 25 STATE: TASK_RUNNING (PANIC) crash> extend /scratch/crash_lustre/lustre.so /scratch/crash_lustre/lustre.so: shared object loaded crash> lustre -l /scratch/lola-9-latest-crash.bin lustre_walk_cpus(0, 5, 1) cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages" {noformat} Attached files: message, console, vmcore-dmesg.txt of node {{lola-9}}. |
New:
Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727) OSTs formatted with zfs, MDSs formatted with ldiskfs DNE is enabled, HSM/robinhood enable and integrated 4 MDSs with 1 MDT / MDS 6 OSSs with 4 OSTs / OSS Server nodes configured in active-active HA confguration (Nodes {{lola-[8,9]}} from a failover cluster) The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165 Sequence of events: * 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover * 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!! * 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9 * 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic: {noformat} ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794 <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!! <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG <4>Pid: 6208, comm: ll_evictor <4> <4>Call Trace: <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81539407>] ? panic+0xa7/0x16f <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc] <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt] <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass] <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc] <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc] <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 {noformat} I couldn't extract the debug from kernel dump {noformat} KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Mon Aug 1 10:03:45 2016 UPTIME: 2 days, 19:40:09 LOAD AVERAGE: 16.98, 16.25, 16.97 TASKS: 1536 NODENAME: lola-9.lola.whamcloud.com RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64 VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016 MACHINE: x86_64 (2693 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 6208 COMMAND: "ll_evictor" TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000] CPU: 25 STATE: TASK_RUNNING (PANIC) crash> extend /scratch/crash_lustre/lustre.so /scratch/crash_lustre/lustre.so: shared object loaded crash> lustre -l /scratch/lola-9-latest-crash.bin lustre_walk_cpus(0, 5, 1) cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages" {noformat} Attached files: message, console, vmcore-dmesg.txt of node {{lola-9}}. |
Attachment | New: messages-lola-9.log.bz2 [ 22440 ] | |
Attachment | New: console-lola-9.log.bz2 [ 22441 ] | |
Attachment | New: vmcore-dmesg.txt.bz2 [ 22442 ] |
ok then let's close out the ticket for now and reopen if it ever does reoccur