[LU-8467] MDS crashed with (tgt_lastrcvd.c:1054:tgt_client_del()) LBUG Created: 02/Aug/16  Updated: 17/Oct/16  Resolved: 17/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Frank Heckes (Inactive) Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168


Attachments: File console-lola-9.log.bz2     File messages-lola-9.log.bz2     File vmcore-dmesg.txt.bz2    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
(Nodes lola-[8,9] from a failover cluster)

The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165

Sequence of events:

  • 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
  • 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
  • 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9
  • 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
    ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794
    <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!!
    <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG
    <4>Pid: 6208, comm: ll_evictor
    <4>
    <4>Call Trace:
    <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
    <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
    <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc]
    <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt]
    <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass]
    <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc]
    <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
    <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
    <4> [<ffffffff810a138e>] kthread+0x9e/0xc0
    <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
    <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
    <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
    <4>
    <0>Kernel panic - not syncing: LBUG
    <4>Pid: 6208, comm: ll_evictor Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
    <4>Call Trace:
    <4> [<ffffffff81539407>] ? panic+0xa7/0x16f
    <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
    <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc]
    <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt]
    <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass]
    <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
    <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
    <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
    <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
    <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
    <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
    <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
    

I couldn't extract the debug from kernel dump

      KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux
    DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Aug  1 10:03:45 2016
      UPTIME: 2 days, 19:40:09
LOAD AVERAGE: 16.98, 16.25, 16.97
       TASKS: 1536
    NODENAME: lola-9.lola.whamcloud.com
     RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64
     VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016
     MACHINE: x86_64  (2693 Mhz)
      MEMORY: 31.9 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 6208
     COMMAND: "ll_evictor"
        TASK: ffff880413fe2040  [THREAD_INFO: ffff880413fec000]
         CPU: 25
       STATE: TASK_RUNNING (PANIC)

crash> extend /scratch/crash_lustre/lustre.so
/scratch/crash_lustre/lustre.so: shared object loaded

crash> lustre -l /scratch/lola-9-latest-crash.bin
lustre_walk_cpus(0, 5, 1)
cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next 
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"

Attached files:
message, console, vmcore-dmesg.txt of node lola-9.



 Comments   
Comment by Frank Heckes (Inactive) [ 02/Aug/16 ]

Crash file has been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-8467/lola-9/127.0.0.1-2016-08-01-10\:04\:00/

Comment by Peter Jones [ 10/Sep/16 ]

Does this issue still occur now LU-8165 has landed?

Comment by Cliff White (Inactive) [ 17/Oct/16 ]

We have not had a re-appearance of this issue since running tip of 2.9, continuing to test

Comment by Peter Jones [ 17/Oct/16 ]

ok then let's close out the ticket for now and reopen if it ever does reoccur

Generated at Sat Feb 10 02:17:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.