Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1086

several crash triggered in key_fini related to a list corruption

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • lustre 2.1
    • 3
    • 6466

    Description

      During the past 3 days, we hit several crashes with those following backtraces on 2 different MDS:

       
      crash> bt
      PID: 13838  TASK: ffff88107c3ea0c0  CPU: 0   COMMAND: "jbd2/dm-1-8"
       #0 [ffff880fd3a4b740] machine_kexec at ffffffff81027a4b
       #1 [ffff880fd3a4b7a0] crash_kexec at ffffffff810a2db2
       #2 [ffff880fd3a4b870] oops_end at ffffffff81481730
       #3 [ffff880fd3a4b8a0] no_context at ffffffff81031d1b
       #4 [ffff880fd3a4b8f0] __bad_area_nosemaphore at ffffffff81031fa5
       #5 [ffff880fd3a4b940] bad_area_nosemaphore at ffffffff81032073
       #6 [ffff880fd3a4b950] __do_page_fault at ffffffff810326fd
       #7 [ffff880fd3a4ba70] do_page_fault at ffffffff8148373e
       #8 [ffff880fd3a4baa0] page_fault at ffffffff81480ac5
          [exception RIP: kmem_cache_free+123]
          RIP: ffffffff81146c5b  RSP: ffff880fd3a4bb50  RFLAGS: 00010086
          RAX: ffffeae3808b7d30  RBX: ffff88085645f000  RCX: 0000000000000000
          RDX: ffffea0000000000  RSI: ffffc90027daa01c  RDI: ffffc90027daa01c
          RBP: ffff880fd3a4bbb0   R8: 0000000000000000   R9: 5a5a5a5a5a5a5a5a
          R10: 5a5a5a5a5a5a5a5a  R11: 5a5a5a5a5a5a5a5a  R12: 0000000000000286
          R13: ffffc90027daa01c  R14: ffff88185d934500  R15: ffff880802c85560
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880fd3a4bbb8] cfs_mem_cache_free at ffffffffa054887e [libcfs]
      #10 [ffff880fd3a4bbc8] osc_key_fini at ffffffffa0876e11 [osc]
      #11 [ffff880fd3a4bc18] key_fini at ffffffffa0610b89 [obdclass]
      #12 [ffff880fd3a4bc48] keys_fini at ffffffffa0610ccf [obdclass]
      #13 [ffff880fd3a4bc98] lu_context_fini at ffffffffa0610ddd [obdclass]
      #14 [ffff880fd3a4bcb8] osd_trans_commit_cb at ffffffffa0aab6c2 [osd_ldiskfs]
      #15 [ffff880fd3a4bd18] jbd2_journal_commit_transaction at ffffffffa00693a3 [jbd2]
      #16 [ffff880fd3a4be68] kjournald2 at ffffffffa006ec28 [jbd2]
      #17 [ffff880fd3a4bee8] kthread at ffffffff81079f36
      #18 [ffff880fd3a4bf48] kernel_thread at ffffffff810041aa
      

      or

      crash> bt
      PID: 18628  TASK: ffff88085b3a1180  CPU: 0   COMMAND: "jbd2/dm-19-8"
       #0 [ffff8807fd4e7740] machine_kexec at ffffffff81027a2b
       #1 [ffff8807fd4e77a0] crash_kexec at ffffffff810a3a52
       #2 [ffff8807fd4e7870] oops_end at ffffffff8147f680
       #3 [ffff8807fd4e78a0] no_context at ffffffff81031ddb
       #4 [ffff8807fd4e78f0] __bad_area_nosemaphore at ffffffff81032065
       #5 [ffff8807fd4e7940] bad_area_nosemaphore at ffffffff81032133
       #6 [ffff8807fd4e7950] __do_page_fault at ffffffff810327bd
       #7 [ffff8807fd4e7a70] do_page_fault at ffffffff8148168e
       #8 [ffff8807fd4e7aa0] page_fault at ffffffff8147ea15
          [exception RIP: kmem_cache_free+123]
          RIP: ffffffff811465eb  RSP: ffff8807fd4e7b50  RFLAGS: 00010086
          RAX: ffffeae380a76130  RBX: ffff88073ce4c000  RCX: 0000000000000000
          RDX: ffffea0000000000  RSI: ffffc9002fd2a01c  RDI: ffffc9002fd2a01c
          RBP: ffff8807fd4e7bb0   R8: 0000000000000000   R9: 5a5a5a5a5a5a5a5a
          R10: 5a5a5a5a5a5a5a5a  R11: 5a5a5a5a5a5a5a5a  R12: 0000000000000286
          R13: ffffc9002fd2a01c  R14: ffff882059a850c0  R15: ffff8817d6207c90
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff8807fd4e7bb8] cfs_mem_cache_free at ffffffffa04e087e [libcfs]
      #10 [ffff8807fd4e7bc8] lov_key_fini at ffffffffa090f811 [lov]
      #11 [ffff8807fd4e7c18] key_fini at ffffffffa05a7a39 [obdclass]
      #12 [ffff8807fd4e7c48] keys_fini at ffffffffa05a7b7f [obdclass]
      #13 [ffff8807fd4e7c98] lu_context_fini at ffffffffa05a7c8d [obdclass]
      #14 [ffff8807fd4e7cb8] osd_trans_commit_cb at ffffffffa0a406c2 [osd_ldiskfs]
      #15 [ffff8807fd4e7d18] jbd2_journal_commit_transaction at ffffffffa005927b [jbd2]
      #16 [ffff8807fd4e7e68] kjournald2 at ffffffffa005eb48 [jbd2]
      #17 [ffff8807fd4e7ee8] kthread at ffffffff8107ad36
      #18 [ffff8807fd4e7f48] kernel_thread at ffffffff810041aa
      

      In the second case, there is a lot of __list_add corruption warning in the dmesg log.

      The first one (as far as I can see in the dmesg log buffer):

      ------------[ cut here ]------------
      WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G        W  ---------------- T)
      Hardware name: bullx super-node
      list_add corruption. prev->next should be next (ffffc9003197e01c), but was ffff880583a54ab8. (prev=ffff880583a54ab8).
      Modules linked in: iptable_filter ip_tables cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fid(U) fld(U) lov(U) lquota(U) osc(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma lpfc scsi_transport_fc scsi_tgt hed sg igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]
      Pid: 18758, comm: mdt_63 Tainted: G        W  ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1
      Call Trace:
       [<ffffffff810540b7>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff810541a6>] ? warn_slowpath_fmt+0x46/0x50
       [<ffffffff81267d3f>] ? __list_add+0x8f/0xa0
       [<ffffffffa05a9111>] ? lu_object_put+0x161/0x1f0 [obdclass]
       [<ffffffffa09e5c08>] ? mdt_getattr_name_lock+0xf08/0x1a40 [mdt]
       [<ffffffffa06c75bb>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
       [<ffffffffa069bb54>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc]
       [<ffffffffa09e6cfa>] ? mdt_intent_getattr+0x32a/0x500 [mdt]
       [<ffffffffa09e01e7>] ? mdt_unpack_req_pack_rep+0x297/0x5d0 [mdt]
       [<ffffffffa04ef625>] ? cfs_hash_bd_lookup_intent+0xe5/0x130 [libcfs]
       [<ffffffffa069cf50>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
       [<ffffffffa09e4790>] ? mdt_intent_policy+0x3c0/0x6b0 [mdt]
       [<ffffffff81042890>] ? fair_enqueue_task_fair+0x190/0x350
       [<ffffffffa0587521>] ? class_handle_hash+0xa1/0x280 [obdclass]
       [<ffffffffa0654afa>] ? ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
       [<ffffffffa0673305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
       [<ffffffffa04ee692>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
       [<ffffffffa067b227>] ? ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
       [<ffffffffa09dffa1>] ? mdt_unpack_req_pack_rep+0x51/0x5d0 [mdt]
       [<ffffffffa09e430a>] ? mdt_enqueue+0x4a/0x110 [mdt]
       [<ffffffffa09e0df5>] ? mdt_handle_common+0x8d5/0x1810 [mdt]
       [<ffffffffa06992d4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
       [<ffffffffa09e1e05>] ? mdt_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa06aa019>] ? ptlrpc_main+0xc79/0x19d0 [ptlrpc]
       [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffff810041aa>] ? child_rip+0xa/0x20
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffff810041a0>] ? child_rip+0x0/0x20
      ---[ end trace b8f1465c05250f4c ]---
      

      The latest one just before OOPS:

      ------------[ cut here ]------------
      WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G        W  ---------------- T)
      Hardware name: bullx super-node
      list_add corruption. prev->next should be next (ffffc9002fd2a01c), but was (null). (prev=ffff88179c9cd1b8).
      Modules linked in: iptable_filter ip_tables cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fid(U) fld(U) lov(U) lquota(U) osc(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma lpfc scsi_transport_fc scsi_tgt hed sg igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]
      Pid: 18750, comm: mdt_55 Tainted: G        W  ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1
      Call Trace:
       [<ffffffff810540b7>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff810541a6>] ? warn_slowpath_fmt+0x46/0x50
       [<ffffffff81267d3f>] ? __list_add+0x8f/0xa0
       [<ffffffffa05a9111>] ? lu_object_put+0x161/0x1f0 [obdclass]
       [<ffffffffa09e5c08>] ? mdt_getattr_name_lock+0xf08/0x1a40 [mdt]
       [<ffffffffa06c75bb>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
       [<ffffffffa069bb54>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc]
       [<ffffffffa09e6cfa>] ? mdt_intent_getattr+0x32a/0x500 [mdt]
       [<ffffffffa09e01e7>] ? mdt_unpack_req_pack_rep+0x297/0x5d0 [mdt]
       [<ffffffffa04ef5ab>] ? cfs_hash_bd_lookup_intent+0x6b/0x130 [libcfs]
       [<ffffffffa069cf50>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
       [<ffffffffa09e4790>] ? mdt_intent_policy+0x3c0/0x6b0 [mdt]
       [<ffffffff81042890>] ? fair_enqueue_task_fair+0x190/0x350
       [<ffffffffa0587521>] ? class_handle_hash+0xa1/0x280 [obdclass]
       [<ffffffffa0654afa>] ? ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
       [<ffffffffa0673305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
       [<ffffffffa04ee692>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
       [<ffffffffa067b227>] ? ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
       [<ffffffffa09dffa1>] ? mdt_unpack_req_pack_rep+0x51/0x5d0 [mdt]
       [<ffffffffa09e430a>] ? mdt_enqueue+0x4a/0x110 [mdt]
       [<ffffffffa09e0df5>] ? mdt_handle_common+0x8d5/0x1810 [mdt]
       [<ffffffffa06992d4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
       [<ffffffffa09e1e05>] ? mdt_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa06aa019>] ? ptlrpc_main+0xc79/0x19d0 [ptlrpc]
       [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffff810041aa>] ? child_rip+0xa/0x20
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
       [<ffffffff810041a0>] ? child_rip+0x0/0x20
      ---[ end trace b8f1465c05250f81 ]---
      

      Alex.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: