[LU-8467] MDS crashed with (tgt_lastrcvd.c:1054:tgt_client_del()) LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- soak
Environment:
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes lola-[8,9] from a failover cluster)

The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165

Sequence of events:

2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9

2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:

ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794
<3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!!
<0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG
<4>Pid: 6208, comm: ll_evictor
<4>
<4>Call Trace:
<4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6208, comm: ll_evictor Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81539407>] ? panic+0xa7/0x16f
<4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

I couldn't extract the debug from kernel dump

      KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Aug  1 10:03:45 2016
      UPTIME: 2 days, 19:40:09
LOAD AVERAGE: 16.98, 16.25, 16.97
       TASKS: 1536
    NODENAME: lola-9.lola.whamcloud.com
     RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64
     VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016
     MACHINE: x86_64  (2693 Mhz)
      MEMORY: 31.9 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 6208
     COMMAND: "ll_evictor"
        TASK: ffff880413fe2040  [THREAD_INFO: ffff880413fec000]
         CPU: 25
       STATE: TASK_RUNNING (PANIC)

crash> extend /scratch/crash_lustre/lustre.so
/scratch/crash_lustre/lustre.so: shared object loaded

crash> lustre -l /scratch/lola-9-latest-crash.bin
lustre_walk_cpus(0, 5, 1)
cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next 
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"

Attached files:
message, console, vmcore-dmesg.txt of node lola-9.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-9.log.bz2
46 kB
02/Aug/16 10:59 AM
messages-lola-9.log.bz2
138 kB
02/Aug/16 10:59 AM
vmcore-dmesg.txt.bz2
29 kB
02/Aug/16 10:59 AM

Activity

[LU-8467] MDS crashed with (tgt_lastrcvd.c:1054:tgt_client_del()) LBUG

Peter Jones made changes - 17/Oct/16 7:39 PM

Resolution		New: Cannot Reproduce [ 5 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Peter Jones added a comment - 17/Oct/16 7:39 PM

ok then let's close out the ticket for now and reopen if it ever does reoccur

Peter Jones added a comment - 17/Oct/16 7:39 PM ok then let's close out the ticket for now and reopen if it ever does reoccur

Cliff White (Inactive) added a comment - 17/Oct/16 7:18 PM

We have not had a re-appearance of this issue since running tip of 2.9, continuing to test

Cliff White (Inactive) added a comment - 17/Oct/16 7:18 PM We have not had a re-appearance of this issue since running tip of 2.9, continuing to test

Cliff White (Inactive) made changes - 17/Oct/16 7:11 PM

Link

New: This issue is related to ST-66 [ ST-66 ]

Peter Jones added a comment - 10/Sep/16 1:18 PM

Does this issue still occur now ~~LU-8165~~ has landed?

Peter Jones added a comment - 10/Sep/16 1:18 PM Does this issue still occur now LU-8165 has landed?

Peter Jones made changes - 12/Aug/16 6:00 PM

Priority

Original: Blocker [ 1 ]

New: Major [ 3 ]

Evan D. Chen (Inactive) made changes - 02/Aug/16 5:52 PM

Assignee

Original: WC Triage [ wc-triage ]

New: Mikhail Pershin [ tappro ]

Joseph Gmitter (Inactive) made changes - 02/Aug/16 5:51 PM

Fix Version/s

New: Lustre 2.9.0 [ 11891 ]

Frank Heckes (Inactive) made changes - 02/Aug/16 11:01 AM

Description

Original: Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes {{lola-[8,9]}} from a failover cluster)

The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165

Sequence of events:
* 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
* 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
* 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO
* 2016-08-01 11:28:20,152:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9 (NOTE: this took 2 hours)

* 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
{noformat}
ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794
<3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!!
<0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG
<4>Pid: 6208, comm: ll_evictor
<4>
<4>Call Trace:
<4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81539407>] ? panic+0xa7/0x16f
<4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
{noformat}

I couldn't extract the debug from kernel dump
{noformat}
      KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Aug 1 10:03:45 2016
      UPTIME: 2 days, 19:40:09
LOAD AVERAGE: 16.98, 16.25, 16.97
       TASKS: 1536
    NODENAME: lola-9.lola.whamcloud.com
     RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64
     VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016
     MACHINE: x86_64 (2693 Mhz)
      MEMORY: 31.9 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 6208
     COMMAND: "ll_evictor"
        TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000]
         CPU: 25
       STATE: TASK_RUNNING (PANIC)

crash> extend /scratch/crash_lustre/lustre.so
/scratch/crash_lustre/lustre.so: shared object loaded

crash> lustre -l /scratch/lola-9-latest-crash.bin
lustre_walk_cpus(0, 5, 1)
cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
{noformat}
Attached files:
message, console, vmcore-dmesg.txt of node {{lola-9}}.

New: Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes {{lola-[8,9]}} from a failover cluster)

The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165

Sequence of events:
* 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
* 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
* 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9
* 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
{noformat}
ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794
<3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!!
<0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG
<4>Pid: 6208, comm: ll_evictor
<4>
<4>Call Trace:
<4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6208, comm: ll_evictor Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81539407>] ? panic+0xa7/0x16f
<4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc]
<4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt]
<4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass]
<4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
<4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
{noformat}

I couldn't extract the debug from kernel dump
{noformat}
      KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Aug 1 10:03:45 2016
      UPTIME: 2 days, 19:40:09
LOAD AVERAGE: 16.98, 16.25, 16.97
       TASKS: 1536
    NODENAME: lola-9.lola.whamcloud.com
     RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64
     VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016
     MACHINE: x86_64 (2693 Mhz)
      MEMORY: 31.9 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 6208
     COMMAND: "ll_evictor"
        TASK: ffff880413fe2040 [THREAD_INFO: ffff880413fec000]
         CPU: 25
       STATE: TASK_RUNNING (PANIC)

crash> extend /scratch/crash_lustre/lustre.so
/scratch/crash_lustre/lustre.so: shared object loaded

crash> lustre -l /scratch/lola-9-latest-crash.bin
lustre_walk_cpus(0, 5, 1)
cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
{noformat}
Attached files:
message, console, vmcore-dmesg.txt of node {{lola-9}}.

Frank Heckes (Inactive) made changes - 02/Aug/16 10:59 AM

Attachment		New: messages-lola-9.log.bz2 [ 22440 ]
Attachment		New: console-lola-9.log.bz2 [ 22441 ]
Attachment		New: vmcore-dmesg.txt.bz2 [ 22442 ]

People

Assignee:: Mikhail Pershin

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Aug/16 10:50 AM

Updated:: 17/Oct/16 7:39 PM

Resolved:: 17/Oct/16 7:39 PM