Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6326

list_del corruption causes MDS crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.3
    • None
    • 3
    • 17712

    Description

      Hi,

      When trying to umount an MDT, we hit the following bug leading to the crash of the server:

      <4>------------[ cut here ]------------
      <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
      <4>Hardware name: bullx
      <4>list_del corruption. prev->next should be ffff88030b092a90, but was 303054534f2d7077
      <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc( U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf ipmi_si ipmi_msghandler acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) i pv6 ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin scsi_dh_emc dm_ multipath mic(U) uinput raid0 ses enclosure serio_raw compat(U) cxgb3 mdio lpfc scsi_transport_fc scsi_tgt igb i2c_algo_bit i2c_core ptp pps_core sg lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom aacraid ahci usb_storage ata_generic pata_jmicron dm_mirror dm _region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
      <4>Pid: 24459, comm: umount Not tainted 2.6.32-504.8.1.el6.Bull.70.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
      <4> [<ffffffff81074ee6>] ? warn_slowpath_fmt+0x46/0x50
      <4> [<ffffffff8129f0de>] ? list_del+0x6e/0xa0
      <4> [<ffffffff811a8372>] ? d_kill+0x22/0x60
      <4> [<ffffffff811a8716>] ? __shrink_dcache_sb+0x366/0x3c0
      <4> [<ffffffff811a8962>] ? shrink_dcache_sb+0x12/0x20
      <4> [<ffffffffa0e53f76>] ? osd_umount+0x56/0x150 [osd_ldiskfs]
      <4> [<ffffffffa0e5b2f7>] ? osd_device_fini+0x147/0x190 [osd_ldiskfs]
      <4> [<ffffffffa0745a73>] ? class_cleanup+0x573/0xd30 [obdclass]
      <4> [<ffffffffa071c0a6>] ? class_name2dev+0x56/0xe0 [obdclass]
      <4> [<ffffffffa074779a>] ? class_process_config+0x156a/0x1ad0 [obdclass]
      <4> [<ffffffffa07408f3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
      <4> [<ffffffffa0747e79>] ? class_manual_cleanup+0x179/0x6f0 [obdclass]
      <4> [<ffffffffa071a57b>] ? class_export_put+0x10b/0x2c0 [obdclass]
      <4> [<ffffffffa0e5d8d5>] ? osd_obd_disconnect+0x1c5/0x1d0 [osd_ldiskfs]
      <4> [<ffffffffa074a41b>] ? lustre_put_lsi+0x1ab/0x11a0 [obdclass]
      <4> [<ffffffffa07529c8>] ? lustre_common_put_super+0x5d8/0xbf0 [obdclass]
      <4> [<ffffffffa07830dd>] ? server_put_super+0x1bd/0xf60 [obdclass]
      <4> [<ffffffff811909bb>] ? generic_shutdown_super+0x5b/0xe0
      <4> [<ffffffff81190aa6>] ? kill_anon_super+0x16/0x60
      <4> [<ffffffffa0749d26>] ? lustre_kill_super+0x36/0x60 [obdclass]
      <4> [<ffffffff81191247>] ? deactivate_super+0x57/0x80
      <4> [<ffffffff811b0e7f>] ? mntput_no_expire+0xbf/0x110
      <4> [<ffffffff811b19cb>] ? sys_umount+0x7b/0x3a0
      <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      <4>---[ end trace 08665e30da4e39f3 ]---
      <4>------------[ cut here ]------------
      

      Here is the backtrace of the faulty process:

      crash> bt
      PID: 24459 TASK: ffff880338436040 CPU: 1 COMMAND: "umount"
       #0 [ffff8800282239c0] machine_kexec at ffffffff8103b71b
       #1 [ffff880028223a20] crash_kexec at ffffffff810c9852
       #2 [ffff880028223af0] oops_end at ffffffff8152ec30
       #3 [ffff880028223b20] no_context at ffffffff8104c80b
       #4 [ffff880028223b70] __bad_area_nosemaphore at ffffffff8104ca95
       #5 [ffff880028223bc0] bad_area_nosemaphore at ffffffff8104cb63
       #6 [ffff880028223bd0] __do_page_fault at ffffffff8104d2bf
       #7 [ffff880028223cf0] do_page_fault at ffffffff81530b7e
       #8 [ffff880028223d20] page_fault at ffffffff8152df35
          [exception RIP: _spin_lock+14]
          RIP: ffffffff8152da1e RSP: ffff880028223dd0 RFLAGS: 00010002
          RAX: 0000000000010000 RBX: 0000000000006450 RCX: 000000000019ae76
          RDX: 0000000000005444 RSI: ffff88030b092740 RDI: 0000000000000040
          RBP: ffff880028223dd0 R8: ffff88002822d140 R9: 0001407b60dde621
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000282
          R13: ffff88030b092740 R14: ffff88063cd50480 R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
       #9 [ffff880028223dd8] kmem_cache_free at ffffffff81175b0f
      #10 [ffff880028223e48] __d_free at ffffffff811a82cf
      #11 [ffff880028223e68] d_callback at ffffffff811a8c12
      #12 [ffff880028223e78] __rcu_process_callbacks at ffffffff810f05f5
      #13 [ffff880028223ec8] rcu_process_callbacks at ffffffff810f083b
      #14 [ffff880028223ed8] __do_softirq at ffffffff8107d8b1
      #15 [ffff880028223f48] call_softirq at ffffffff8100c30c
      #16 [ffff880028223f60] do_softirq at ffffffff8100fb55
      #17 [ffff880028223f80] irq_exit at ffffffff8107d765
      #18 [ffff880028223f90] smp_apic_timer_interrupt at ffffffff815347ca
      #19 [ffff880028223fb0] apic_timer_interrupt at ffffffff8100bb93
      --- <IRQ stack> ---
      #20 [ffff8802326db798] apic_timer_interrupt at ffffffff8100bb93
          [exception RIP: vprintk+593]
          RIP: ffffffff81075d51 RSP: ffff8802326db848 RFLAGS: 00000246
          RAX: 0000000000010500 RBX: ffff8802326db8d8 RCX: 00000000000049b5
          RDX: ffff880028220000 RSI: 0000000000000046 RDI: 0000000000000246
          RBP: ffffffff8100bb8e R8: 0000000000000000 R9: ffffffff81649020
          R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81074f95
          R13: ffff8802326db7d8 R14: 0000000000000046 R15: 0000000000000025
          ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
      #21 [ffff8802326db8e0] printk at ffffffff81529f80
      #22 [ffff8802326db940] warn_slowpath_common at ffffffff81074dae
      #23 [ffff8802326db980] warn_slowpath_fmt at ffffffff81074ee6
      #24 [ffff8802326db9e0] list_del at ffffffff8129f0de
      #25 [ffff8802326dba00] d_kill at ffffffff811a8372
      #26 [ffff8802326dba20] __shrink_dcache_sb at ffffffff811a8716
      #27 [ffff8802326dbac0] shrink_dcache_sb at ffffffff811a8962
      #28 [ffff8802326dbad0] osd_umount at ffffffffa0e53f76 [osd_ldiskfs]
      #29 [ffff8802326dbaf0] osd_device_fini at ffffffffa0e5b2f7 [osd_ldiskfs]
      #30 [ffff8802326dbb10] class_cleanup at ffffffffa0745a73 [obdclass]
      #31 [ffff8802326dbb90] class_process_config at ffffffffa074779a [obdclass]
      #32 [ffff8802326dbc20] class_manual_cleanup at ffffffffa0747e79 [obdclass]
      #33 [ffff8802326dbce0] osd_obd_disconnect at ffffffffa0e5d8d5 [osd_ldiskfs]
      #34 [ffff8802326dbd20] lustre_put_lsi at ffffffffa074a41b [obdclass]
      #35 [ffff8802326dbd50] lustre_common_put_super at ffffffffa07529c8 [obdclass]
      #36 [ffff8802326dbdc0] server_put_super at ffffffffa07830dd [obdclass]
      #37 [ffff8802326dbe30] generic_shutdown_super at ffffffff811909bb
      #38 [ffff8802326dbe50] kill_anon_super at ffffffff81190aa6
      #39 [ffff8802326dbe70] lustre_kill_super at ffffffffa0749d26 [obdclass]
      #40 [ffff8802326dbe90] deactivate_super at ffffffff81191247
      #41 [ffff8802326dbeb0] mntput_no_expire at ffffffff811b0e7f
      #42 [ffff8802326dbee0] sys_umount at ffffffff811b19cb
      #43 [ffff8802326dbf80] system_call_fastpath at ffffffff8100b072
          RIP: 00007f42273d0997 RSP: 00007fff3f71d3a8 RFLAGS: 00010202
          RAX: 00000000000000a6 RBX: ffffffff8100b072 RCX: 00000000fbad2498
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f4229e83be0
          RBP: 00007f4229e83bc0 R8: 00007f4229e83c00 R9: 0000000000000000
          R10: 00007fff3f71d140 R11: 0000000000000246 R12: 0000000000000000
          R13: 0000000000000000 R14: 0000000000000000 R15: 00007f4229e83c70
          ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
      

      Sebastien.

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: