Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
None
-
lustre 2.1
-
3
-
6466
Description
During the past 3 days, we hit several crashes with those following backtraces on 2 different MDS:
crash> bt PID: 13838 TASK: ffff88107c3ea0c0 CPU: 0 COMMAND: "jbd2/dm-1-8" #0 [ffff880fd3a4b740] machine_kexec at ffffffff81027a4b #1 [ffff880fd3a4b7a0] crash_kexec at ffffffff810a2db2 #2 [ffff880fd3a4b870] oops_end at ffffffff81481730 #3 [ffff880fd3a4b8a0] no_context at ffffffff81031d1b #4 [ffff880fd3a4b8f0] __bad_area_nosemaphore at ffffffff81031fa5 #5 [ffff880fd3a4b940] bad_area_nosemaphore at ffffffff81032073 #6 [ffff880fd3a4b950] __do_page_fault at ffffffff810326fd #7 [ffff880fd3a4ba70] do_page_fault at ffffffff8148373e #8 [ffff880fd3a4baa0] page_fault at ffffffff81480ac5 [exception RIP: kmem_cache_free+123] RIP: ffffffff81146c5b RSP: ffff880fd3a4bb50 RFLAGS: 00010086 RAX: ffffeae3808b7d30 RBX: ffff88085645f000 RCX: 0000000000000000 RDX: ffffea0000000000 RSI: ffffc90027daa01c RDI: ffffc90027daa01c RBP: ffff880fd3a4bbb0 R8: 0000000000000000 R9: 5a5a5a5a5a5a5a5a R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000286 R13: ffffc90027daa01c R14: ffff88185d934500 R15: ffff880802c85560 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880fd3a4bbb8] cfs_mem_cache_free at ffffffffa054887e [libcfs] #10 [ffff880fd3a4bbc8] osc_key_fini at ffffffffa0876e11 [osc] #11 [ffff880fd3a4bc18] key_fini at ffffffffa0610b89 [obdclass] #12 [ffff880fd3a4bc48] keys_fini at ffffffffa0610ccf [obdclass] #13 [ffff880fd3a4bc98] lu_context_fini at ffffffffa0610ddd [obdclass] #14 [ffff880fd3a4bcb8] osd_trans_commit_cb at ffffffffa0aab6c2 [osd_ldiskfs] #15 [ffff880fd3a4bd18] jbd2_journal_commit_transaction at ffffffffa00693a3 [jbd2] #16 [ffff880fd3a4be68] kjournald2 at ffffffffa006ec28 [jbd2] #17 [ffff880fd3a4bee8] kthread at ffffffff81079f36 #18 [ffff880fd3a4bf48] kernel_thread at ffffffff810041aa
or
crash> bt PID: 18628 TASK: ffff88085b3a1180 CPU: 0 COMMAND: "jbd2/dm-19-8" #0 [ffff8807fd4e7740] machine_kexec at ffffffff81027a2b #1 [ffff8807fd4e77a0] crash_kexec at ffffffff810a3a52 #2 [ffff8807fd4e7870] oops_end at ffffffff8147f680 #3 [ffff8807fd4e78a0] no_context at ffffffff81031ddb #4 [ffff8807fd4e78f0] __bad_area_nosemaphore at ffffffff81032065 #5 [ffff8807fd4e7940] bad_area_nosemaphore at ffffffff81032133 #6 [ffff8807fd4e7950] __do_page_fault at ffffffff810327bd #7 [ffff8807fd4e7a70] do_page_fault at ffffffff8148168e #8 [ffff8807fd4e7aa0] page_fault at ffffffff8147ea15 [exception RIP: kmem_cache_free+123] RIP: ffffffff811465eb RSP: ffff8807fd4e7b50 RFLAGS: 00010086 RAX: ffffeae380a76130 RBX: ffff88073ce4c000 RCX: 0000000000000000 RDX: ffffea0000000000 RSI: ffffc9002fd2a01c RDI: ffffc9002fd2a01c RBP: ffff8807fd4e7bb0 R8: 0000000000000000 R9: 5a5a5a5a5a5a5a5a R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 0000000000000286 R13: ffffc9002fd2a01c R14: ffff882059a850c0 R15: ffff8817d6207c90 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff8807fd4e7bb8] cfs_mem_cache_free at ffffffffa04e087e [libcfs] #10 [ffff8807fd4e7bc8] lov_key_fini at ffffffffa090f811 [lov] #11 [ffff8807fd4e7c18] key_fini at ffffffffa05a7a39 [obdclass] #12 [ffff8807fd4e7c48] keys_fini at ffffffffa05a7b7f [obdclass] #13 [ffff8807fd4e7c98] lu_context_fini at ffffffffa05a7c8d [obdclass] #14 [ffff8807fd4e7cb8] osd_trans_commit_cb at ffffffffa0a406c2 [osd_ldiskfs] #15 [ffff8807fd4e7d18] jbd2_journal_commit_transaction at ffffffffa005927b [jbd2] #16 [ffff8807fd4e7e68] kjournald2 at ffffffffa005eb48 [jbd2] #17 [ffff8807fd4e7ee8] kthread at ffffffff8107ad36 #18 [ffff8807fd4e7f48] kernel_thread at ffffffff810041aa
In the second case, there is a lot of __list_add corruption warning in the dmesg log.
The first one (as far as I can see in the dmesg log buffer):
------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G W ---------------- T) Hardware name: bullx super-node list_add corruption. prev->next should be next (ffffc9003197e01c), but was ffff880583a54ab8. (prev=ffff880583a54ab8). Modules linked in: iptable_filter ip_tables cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fid(U) fld(U) lov(U) lquota(U) osc(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma lpfc scsi_transport_fc scsi_tgt hed sg igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode] Pid: 18758, comm: mdt_63 Tainted: G W ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1 Call Trace: [<ffffffff810540b7>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff810541a6>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff81267d3f>] ? __list_add+0x8f/0xa0 [<ffffffffa05a9111>] ? lu_object_put+0x161/0x1f0 [obdclass] [<ffffffffa09e5c08>] ? mdt_getattr_name_lock+0xf08/0x1a40 [mdt] [<ffffffffa06c75bb>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc] [<ffffffffa069bb54>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc] [<ffffffffa09e6cfa>] ? mdt_intent_getattr+0x32a/0x500 [mdt] [<ffffffffa09e01e7>] ? mdt_unpack_req_pack_rep+0x297/0x5d0 [mdt] [<ffffffffa04ef625>] ? cfs_hash_bd_lookup_intent+0xe5/0x130 [libcfs] [<ffffffffa069cf50>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc] [<ffffffffa09e4790>] ? mdt_intent_policy+0x3c0/0x6b0 [mdt] [<ffffffff81042890>] ? fair_enqueue_task_fair+0x190/0x350 [<ffffffffa0587521>] ? class_handle_hash+0xa1/0x280 [obdclass] [<ffffffffa0654afa>] ? ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc] [<ffffffffa0673305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc] [<ffffffffa04ee692>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] [<ffffffffa067b227>] ? ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc] [<ffffffffa09dffa1>] ? mdt_unpack_req_pack_rep+0x51/0x5d0 [mdt] [<ffffffffa09e430a>] ? mdt_enqueue+0x4a/0x110 [mdt] [<ffffffffa09e0df5>] ? mdt_handle_common+0x8d5/0x1810 [mdt] [<ffffffffa06992d4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] [<ffffffffa09e1e05>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06aa019>] ? ptlrpc_main+0xc79/0x19d0 [ptlrpc] [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320 [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffff810041aa>] ? child_rip+0xa/0x20 [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffff810041a0>] ? child_rip+0x0/0x20 ---[ end trace b8f1465c05250f4c ]---
The latest one just before OOPS:
------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G W ---------------- T) Hardware name: bullx super-node list_add corruption. prev->next should be next (ffffc9002fd2a01c), but was (null). (prev=ffff88179c9cd1b8). Modules linked in: iptable_filter ip_tables cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fid(U) fld(U) lov(U) lquota(U) osc(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma lpfc scsi_transport_fc scsi_tgt hed sg igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode] Pid: 18750, comm: mdt_55 Tainted: G W ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1 Call Trace: [<ffffffff810540b7>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff810541a6>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff81267d3f>] ? __list_add+0x8f/0xa0 [<ffffffffa05a9111>] ? lu_object_put+0x161/0x1f0 [obdclass] [<ffffffffa09e5c08>] ? mdt_getattr_name_lock+0xf08/0x1a40 [mdt] [<ffffffffa06c75bb>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc] [<ffffffffa069bb54>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc] [<ffffffffa09e6cfa>] ? mdt_intent_getattr+0x32a/0x500 [mdt] [<ffffffffa09e01e7>] ? mdt_unpack_req_pack_rep+0x297/0x5d0 [mdt] [<ffffffffa04ef5ab>] ? cfs_hash_bd_lookup_intent+0x6b/0x130 [libcfs] [<ffffffffa069cf50>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc] [<ffffffffa09e4790>] ? mdt_intent_policy+0x3c0/0x6b0 [mdt] [<ffffffff81042890>] ? fair_enqueue_task_fair+0x190/0x350 [<ffffffffa0587521>] ? class_handle_hash+0xa1/0x280 [obdclass] [<ffffffffa0654afa>] ? ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc] [<ffffffffa0673305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc] [<ffffffffa04ee692>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] [<ffffffffa067b227>] ? ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc] [<ffffffffa09dffa1>] ? mdt_unpack_req_pack_rep+0x51/0x5d0 [mdt] [<ffffffffa09e430a>] ? mdt_enqueue+0x4a/0x110 [mdt] [<ffffffffa09e0df5>] ? mdt_handle_common+0x8d5/0x1810 [mdt] [<ffffffffa06992d4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] [<ffffffffa09e1e05>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06aa019>] ? ptlrpc_main+0xc79/0x19d0 [ptlrpc] [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320 [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffff810041aa>] ? child_rip+0xa/0x20 [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffffa06a93a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc] [<ffffffff810041a0>] ? child_rip+0x0/0x20 ---[ end trace b8f1465c05250f81 ]---
Alex.
Attachments
Issue Links
- duplicates
-
LU-1013 recovery-mds lu_object.c:116:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed
-
- Resolved
-