Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8467

MDS crashed with (tgt_lastrcvd.c:1054:tgt_client_del()) LBUG

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • Lustre 2.9.0
    • None
    • lola
      build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
    • 3
    • 9223372036854775807

    Description

      Error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
      OSTs formatted with zfs, MDSs formatted with ldiskfs
      DNE is enabled, HSM/robinhood enable and integrated
      4 MDSs with 1 MDT / MDS
      6 OSSs with 4 OSTs / OSS
      Server nodes configured in active-active HA confguration
      (Nodes lola-[8,9] from a failover cluster)

      The issue is eventually a duplicate of https://jira.hpdd.intel.com/browse/LU-8165

      Sequence of events:

      • 2016-08-01 09:20:35,183:fsmgmt.fsmgmt:INFO triggering fault mds_failover
      • 2016-08-01 09:27:04,811:fsmgmt.fsmgmt:INFO lola-8 is up!!!
      • 2016-08-01 09:27:15,825:fsmgmt.fsmgmt:INFO started mount of MDT0000 on lola-9
      • 2016-08-01 10:04:00 During mount of MDT on secondary node the (secondary) MDS crashed with kernel panic:
        ds. I think it's dead, and I am evicting it. exp ffff88081e31e800, cur 1470071025 expire 1470070875 last 1470070794
        <3>LustreError: 6208:0:(tgt_lastrcvd.c:1053:tgt_client_del()) soaked-MDT0001: client 4294967295: bit already clear in bitmap!!
        <0>LustreError: 6208:0:(tgt_lastrcvd.c:1054:tgt_client_del()) LBUG
        <4>Pid: 6208, comm: ll_evictor
        <4>
        <4>Call Trace:
        <4> [<ffffffffa07fc875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        <4> [<ffffffffa07fce77>] lbug_with_loc+0x47/0xb0 [libcfs]
        <4> [<ffffffffa0b6cb62>] tgt_client_del+0x5f2/0x600 [ptlrpc]
        <4> [<ffffffffa11f4d8e>] mdt_obd_disconnect+0x48e/0x570 [mdt]
        <4> [<ffffffffa08e538d>] class_fail_export+0x23d/0x530 [obdclass]
        <4> [<ffffffffa0b23485>] ping_evictor_main+0x245/0x650 [ptlrpc]
        <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
        <4> [<ffffffff810a138e>] kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        <4>
        <0>Kernel panic - not syncing: LBUG
        <4>Pid: 6208, comm: ll_evictor Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
        <4>Call Trace:
        <4> [<ffffffff81539407>] ? panic+0xa7/0x16f
        <4> [<ffffffffa07fcecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
        <4> [<ffffffffa0b6cb62>] ? tgt_client_del+0x5f2/0x600 [ptlrpc]
        <4> [<ffffffffa11f4d8e>] ? mdt_obd_disconnect+0x48e/0x570 [mdt]
        <4> [<ffffffffa08e538d>] ? class_fail_export+0x23d/0x530 [obdclass]
        <4> [<ffffffffa0b23485>] ? ping_evictor_main+0x245/0x650 [ptlrpc]
        <4> [<ffffffff81067650>] ? default_wake_function+0x0/0x20
        <4> [<ffffffffa0b23240>] ? ping_evictor_main+0x0/0x650 [ptlrpc]
        <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

      I couldn't extract the debug from kernel dump

            KERNEL: usr/lib/debug/lib/modules/2.6.32-573.26.1.el6_lustre.x86_64/vmlinux
          DUMPFILE: 127.0.0.1-2016-08-01-10:04:00/vmcore  [PARTIAL DUMP]
              CPUS: 32
              DATE: Mon Aug  1 10:03:45 2016
            UPTIME: 2 days, 19:40:09
      LOAD AVERAGE: 16.98, 16.25, 16.97
             TASKS: 1536
          NODENAME: lola-9.lola.whamcloud.com
           RELEASE: 2.6.32-573.26.1.el6_lustre.x86_64
           VERSION: #1 SMP Tue Jul 26 04:04:13 PDT 2016
           MACHINE: x86_64  (2693 Mhz)
            MEMORY: 31.9 GB
             PANIC: "Kernel panic - not syncing: LBUG"
               PID: 6208
           COMMAND: "ll_evictor"
              TASK: ffff880413fe2040  [THREAD_INFO: ffff880413fec000]
               CPU: 25
             STATE: TASK_RUNNING (PANIC)
      
      crash> extend /scratch/crash_lustre/lustre.so
      /scratch/crash_lustre/lustre.so: shared object loaded
      
      crash> lustre -l /scratch/lola-9-latest-crash.bin
      lustre_walk_cpus(0, 5, 1)
      cmd p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages // p (*cfs_trace_data[0])[0].tcd.tcd_pages.next 
      lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
      

      Attached files:
      message, console, vmcore-dmesg.txt of node lola-9.

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: