Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.12.3
-
None
-
CentOS 7.6, Lustre 2.12.3_4
-
3
-
9223372036854775807
Description
Happy New Year
We hit the following 2.12.3 MDS crash on MDT0000 while rebooting the robinhood server:
[2206088.009705] Lustre: 109638:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0000: catlog [0x5:0xa:0x0] crosses index zero [2206091.272923] general protection fault: 0000 [#1] SMP [2206091.278116] Modules linked in: osp(OE) mdd(OE) mdt(OE) lustre(OE) mdc(OE) mgs(OE) lod(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lmv(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas ipmi_si ses ipmi_devintf aesni_intel lrw gf128mul glue_helper enclosure ablk_helper pcspkr cryptd sg dm_multipath ipmi_msghandler ccp acpi_power_meter dm_mod i2c_piix4 k10temp ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE) [2206091.351292] ib_uverbs(OE) ib_core(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm mlx5_core(OE) libahci mlxfw(OE) devlink crct10dif_pclmul mpt3sas(OE) drm tg3 crct10dif_common mlx_compat(OE) libata raid_class crc32c_intel megaraid_sas ptp scsi_transport_sas drm_panel_orientation_quirks pps_core [last unloaded: osp] [2206091.382061] CPU: 36 PID: 109638 Comm: mdt00_030 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 [2206091.395007] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019 [2206091.402832] task: ffff8865c0ae5140 ti: ffff886229cd4000 task.ti: ffff886229cd4000 [2206091.410484] RIP: 0010:[<ffffffffc0c12784>] [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass] [2206091.420443] RSP: 0018:ffff886229cd7668 EFLAGS: 00010246 [2206091.425928] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000002000 RCX: 0000000000063c58 [2206091.433235] RDX: 0000000000000000 RSI: ffffffffc0c87f20 RDI: ffffffffc0cbd320 [2206091.440539] RBP: ffff886229cd7730 R08: 0000000000000001 R09: 0000000000000000 [2206091.447845] R10: 0000000000000000 R11: 0000000000000000 R12: ffff886229cd77e0 [2206091.455152] R13: ffff888080b9b000 R14: ffff8867bb808c00 R15: 000000000000fc8d [2206091.462458] FS: 00007fa4a0ec4700(0000) GS:ffff885bff040000(0000) knlGS:0000000000000000 [2206091.470717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2206091.476636] CR2: 00007fa4ae229000 CR3: 000000318c610000 CR4: 00000000003407e0 [2206091.483944] Call Trace: [2206091.486596] [<ffffffffc0c08737>] ? llog_cat_cancel_records+0x1e7/0x3c0 [obdclass] [2206091.494348] [<ffffffffc1487814>] ? osd_trunc_unlock_all+0xf4/0x160 [osd_ldiskfs] [2206091.502011] [<ffffffffc0c021e0>] llog_process_thread+0x330/0x18e0 [obdclass] [2206091.509329] [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd] [2206091.516564] [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass] [2206091.523800] [<ffffffffc0c08b49>] llog_cat_process_cb+0x239/0x250 [obdclass] [2206091.531038] [<ffffffffc0c026df>] llog_process_thread+0x82f/0x18e0 [obdclass] [2206091.538361] [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass] [2206091.546116] [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass] [2206091.553353] [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass] [2206091.561111] [<ffffffffc0c0554e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass] [2206091.568786] [<ffffffffc0c2dba1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] [2206091.576093] [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd] [2206091.583322] [<ffffffffc0c0575e>] llog_cat_process+0x2e/0x30 [obdclass] [2206091.590116] [<ffffffffc17f1a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd] [2206091.597682] [<ffffffff8c703612>] ? ktime_get+0x52/0xe0 [2206091.603089] [<ffffffffc17f3b60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd] [2206091.610310] [<ffffffffc17f6c13>] mdd_changelog_clear+0x503/0x690 [mdd] [2206091.617105] [<ffffffffc17f9cf3>] mdd_iocontrol+0x163/0x540 [mdd] [2206091.623396] [<ffffffffc0c4d663>] ? lu_context_init+0xd3/0x1f0 [obdclass] [2206091.630367] [<ffffffffc172e8cc>] mdt_iocontrol+0x5ec/0xb00 [mdt] [2206091.636643] [<ffffffffc172f264>] mdt_set_info+0x484/0x490 [mdt] [2206091.642883] [<ffffffffc0f8f36a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [2206091.649965] [<ffffffffc0f6ada1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2206091.657712] [<ffffffffc0ab2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2206091.664964] [<ffffffffc0f3624b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2206091.672824] [<ffffffffc0f31805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2206091.679791] [<ffffffff8c6cfeb4>] ? __wake_up+0x44/0x50 [2206091.685217] [<ffffffffc0f39bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [2206091.691689] [<ffffffffc0f39080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2206091.699255] [<ffffffff8c6c2e81>] kthread+0xd1/0xe0 [2206091.704303] [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40 [2206091.710573] [<ffffffff8cd77c24>] ret_from_fork_nospec_begin+0xe/0x21 [2206091.717183] [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40 [2206091.723447] Code: 05 d6 ab 0a 00 00 00 02 00 48 c7 c7 20 d3 cb c0 48 8b 40 08 48 c7 05 a8 ab 0a 00 b0 76 c8 c0 45 8b 4e 54 45 8b 46 50 49 8b 4e 48 <48> 8b 50 28 49 8b 04 24 89 5c 24 10 48 89 44 24 08 41 8b 46 58 [2206091.744144] RIP [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass] [2206091.751763] RSP <ffff886229cd7668>
Crash shows 2 running threads doing both mdd_changelog_llog_cancel:
crash> ps | grep ">" > 0 0 0 ffffffff8d218480 RU 0.0 0 0 [swapper/0] > 0 0 1 ffff885ca9f38000 RU 0.0 0 0 [swapper/1] > 0 0 2 ffff886ca9400000 RU 0.0 0 0 [swapper/2] > 0 0 3 ffff887ca9eae180 RU 0.0 0 0 [swapper/3] > 0 0 4 ffff884d29b96180 RU 0.0 0 0 [swapper/4] > 0 0 5 ffff885ca9f3e180 RU 0.0 0 0 [swapper/5] > 0 0 6 ffff886ca9406180 RU 0.0 0 0 [swapper/6] > 0 0 7 ffff887ca9eaa080 RU 0.0 0 0 [swapper/7] > 0 0 8 ffff884d29b930c0 RU 0.0 0 0 [swapper/8] > 0 0 9 ffff885ca9f39040 RU 0.0 0 0 [swapper/9] > 0 0 10 ffff886ca9401040 RU 0.0 0 0 [swapper/10] > 0 0 11 ffff887ca9ead140 RU 0.0 0 0 [swapper/11] > 0 0 12 ffff884d29b95140 RU 0.0 0 0 [swapper/12] > 0 0 13 ffff885ca9f3d140 RU 0.0 0 0 [swapper/13] > 0 0 14 ffff886ca9405140 RU 0.0 0 0 [swapper/14] > 0 0 15 ffff887ca9eab0c0 RU 0.0 0 0 [swapper/15] > 0 0 16 ffff884d29b94100 RU 0.0 0 0 [swapper/16] > 0 0 17 ffff885ca9f3a080 RU 0.0 0 0 [swapper/17] > 0 0 18 ffff886ca9402080 RU 0.0 0 0 [swapper/18] > 0 0 19 ffff887ca9eac100 RU 0.0 0 0 [swapper/19] > 0 0 20 ffff884d29bb8000 RU 0.0 0 0 [swapper/20] > 0 0 21 ffff885ca9f3c100 RU 0.0 0 0 [swapper/21] > 0 0 22 ffff886ca9404100 RU 0.0 0 0 [swapper/22] > 0 0 23 ffff887ca9f20000 RU 0.0 0 0 [swapper/23] > 0 0 24 ffff884d29bbe180 RU 0.0 0 0 [swapper/24] > 0 0 25 ffff885ca9f3b0c0 RU 0.0 0 0 [swapper/25] > 0 0 26 ffff886ca94030c0 RU 0.0 0 0 [swapper/26] > 0 0 27 ffff887ca9f26180 RU 0.0 0 0 [swapper/27] > 0 0 28 ffff884d29bb9040 RU 0.0 0 0 [swapper/28] > 0 0 29 ffff885ca9f78000 RU 0.0 0 0 [swapper/29] > 0 0 30 ffff886ca9430000 RU 0.0 0 0 [swapper/30] > 0 0 31 ffff887ca9f21040 RU 0.0 0 0 [swapper/31] > 0 0 32 ffff884d29bbd140 RU 0.0 0 0 [swapper/32] > 0 0 33 ffff885ca9f7e180 RU 0.0 0 0 [swapper/33] > 0 0 34 ffff886ca9436180 RU 0.0 0 0 [swapper/34] > 0 0 35 ffff887ca9f25140 RU 0.0 0 0 [swapper/35] > 0 0 37 ffff885ca9f79040 RU 0.0 0 0 [swapper/37] > 0 0 38 ffff886ca9431040 RU 0.0 0 0 [swapper/38] > 0 0 39 ffff887ca9f22080 RU 0.0 0 0 [swapper/39] > 0 0 40 ffff884d29bbc100 RU 0.0 0 0 [swapper/40] > 0 0 41 ffff885ca9f7d140 RU 0.0 0 0 [swapper/41] > 0 0 42 ffff886ca9435140 RU 0.0 0 0 [swapper/42] > 0 0 43 ffff887ca9f24100 RU 0.0 0 0 [swapper/43] > 0 0 45 ffff885ca9f7a080 RU 0.0 0 0 [swapper/45] > 0 0 46 ffff886ca9432080 RU 0.0 0 0 [swapper/46] > 0 0 47 ffff887ca9f230c0 RU 0.0 0 0 [swapper/47] > 43092 2 44 ffff885bf9c6c100 RU 0.0 0 0 [mdt00_080] > 109638 2 36 ffff8865c0ae5140 RU 0.0 0 0 [mdt00_030] crash> bt 43092 PID: 43092 TASK: ffff885bf9c6c100 CPU: 44 COMMAND: "mdt00_080" #0 [ffff885bff0c8e48] crash_nmi_callback at ffffffff8c656027 #1 [ffff885bff0c8e58] nmi_handle at ffffffff8cd6f91c #2 [ffff885bff0c8eb0] do_nmi at ffffffff8cd6fb3d #3 [ffff885bff0c8ef0] end_repeat_nmi at ffffffff8cd6ed89 [exception RIP: osd_idc_add+6] RIP: ffffffffc1459f36 RSP: ffff88517f577510 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff8853762ac800 RCX: 0000000000000002 RDX: ffff884f2a223a40 RSI: ffff888b8fa7e000 RDI: ffff88517f577c20 RBP: ffff88517f577548 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000006 R11: ffff8877d65d7a00 R12: ffff884f2a223a40 R13: ffff888b8fa7e000 R14: ffff88517f577c20 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #4 [ffff88517f577510] osd_idc_add at ffffffffc1459f36 [osd_ldiskfs] #5 [ffff88517f577518] osd_idc_find_and_init at ffffffffc145f360 [osd_ldiskfs] #6 [ffff88517f577550] osd_declare_destroy at ffffffffc1461deb [osd_ldiskfs] #7 [ffff88517f5775a8] llog_osd_declare_destroy at ffffffffc0c15ab1 [obdclass] #8 [ffff88517f5775e8] llog_declare_destroy at ffffffffc0bfea04 [obdclass] #9 [ffff88517f577610] llog_cancel_rec at ffffffffc0c019c6 [obdclass] #10 [ffff88517f577670] llog_cat_cancel_records at ffffffffc0c08720 [obdclass] #11 [ffff88517f5776d8] llog_changelog_cancel_cb at ffffffffc17f5264 [mdd] #12 [ffff88517f577738] llog_process_thread at ffffffffc0c026df [obdclass] #13 [ffff88517f577848] llog_process_or_fork at ffffffffc0c0384c [obdclass] #14 [ffff88517f577890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass] #15 [ffff88517f5778e0] llog_process_thread at ffffffffc0c026df [obdclass] #16 [ffff88517f5779f0] llog_process_or_fork at ffffffffc0c0384c [obdclass] #17 [ffff88517f577a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass] #18 [ffff88517f577ab0] llog_cat_process at ffffffffc0c0575e [obdclass] #19 [ffff88517f577ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd] #20 [ffff88517f577b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd] #21 [ffff88517f577b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd] #22 [ffff88517f577bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd] #23 [ffff88517f577bf8] mdt_iocontrol at ffffffffc172e8cc [mdt] #24 [ffff88517f577c90] mdt_set_info at ffffffffc172f264 [mdt] #25 [ffff88517f577cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc] #26 [ffff88517f577d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc] #27 [ffff88517f577df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc] #28 [ffff88517f577ec8] kthread at ffffffff8c6c2e81 #29 [ffff88517f577f50] ret_from_fork_nospec_begin at ffffffff8cd77c24 crash> bt 109638 PID: 109638 TASK: ffff8865c0ae5140 CPU: 36 COMMAND: "mdt00_030" #0 [ffff886229cd73d8] machine_kexec at ffffffff8c663b34 #1 [ffff886229cd7438] __crash_kexec at ffffffff8c71e242 #2 [ffff886229cd7508] crash_kexec at ffffffff8c71e330 #3 [ffff886229cd7520] oops_end at ffffffff8cd6f778 #4 [ffff886229cd7548] die at ffffffff8c62f95b #5 [ffff886229cd7578] do_general_protection at ffffffff8cd6f072 #6 [ffff886229cd75b0] general_protection at ffffffff8cd6e708 [exception RIP: llog_osd_next_block+2404] RIP: ffffffffc0c12784 RSP: ffff886229cd7668 RFLAGS: 00010246 RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000002000 RCX: 0000000000063c58 RDX: 0000000000000000 RSI: ffffffffc0c87f20 RDI: ffffffffc0cbd320 RBP: ffff886229cd7730 R8: 0000000000000001 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff886229cd77e0 R13: ffff888080b9b000 R14: ffff8867bb808c00 R15: 000000000000fc8d ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff886229cd7660] llog_osd_next_block at ffffffffc0c1218a [obdclass] #8 [ffff886229cd7738] llog_process_thread at ffffffffc0c021e0 [obdclass] #9 [ffff886229cd7848] llog_process_or_fork at ffffffffc0c0384c [obdclass] #10 [ffff886229cd7890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass] #11 [ffff886229cd78e0] llog_process_thread at ffffffffc0c026df [obdclass] #12 [ffff886229cd79f0] llog_process_or_fork at ffffffffc0c0384c [obdclass] #13 [ffff886229cd7a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass] #14 [ffff886229cd7ab0] llog_cat_process at ffffffffc0c0575e [obdclass] #15 [ffff886229cd7ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd] #16 [ffff886229cd7b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd] #17 [ffff886229cd7b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd] #18 [ffff886229cd7bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd] #19 [ffff886229cd7bf8] mdt_iocontrol at ffffffffc172e8cc [mdt] #20 [ffff886229cd7c90] mdt_set_info at ffffffffc172f264 [mdt] #21 [ffff886229cd7cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc] #22 [ffff886229cd7d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc] #23 [ffff886229cd7df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc] #24 [ffff886229cd7ec8] kthread at ffffffff8c6c2e81 #25 [ffff886229cd7f50] ret_from_fork_nospec_begin at ffffffff8cd77c24
- vmcore available as vmcore-fir-md1-s1_2020-01-04-18-45-26 (md5sum is 5921fdd25b1aada8f52dbdc79f1f4ba1)
- vmcore dmesg available as vmcore-dmesg_fir-md1-s1_2020-01-04-18-45-26.txt
Attachments
Issue Links
- is duplicated by
-
LU-15769 llog_osd_next_block()): invalid llog tail at log id ...
- Resolved