[LU-13113] MDT changelogs crash RIP: llog_osd_next_block Created: 07/Jan/20  Updated: 15/Dec/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6, Lustre 2.12.3_4


Attachments: Text File vmcore-dmesg_fir-md1-s1_2020-01-04-18-45-26.txt    
Issue Links:
Duplicate
is duplicated by LU-15769 llog_osd_next_block()): invalid llog ... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Happy New Year

We hit the following 2.12.3 MDS crash on MDT0000 while rebooting the robinhood server:

[2206088.009705] Lustre: 109638:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0000: catlog [0x5:0xa:0x0] crosses index zero
[2206091.272923] general protection fault: 0000 [#1] SMP 
[2206091.278116] Modules linked in: osp(OE) mdd(OE) mdt(OE) lustre(OE) mdc(OE) mgs(OE) lod(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lmv(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas ipmi_si ses ipmi_devintf aesni_intel lrw gf128mul glue_helper enclosure ablk_helper pcspkr cryptd sg dm_multipath ipmi_msghandler ccp acpi_power_meter dm_mod i2c_piix4 k10temp ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE)
[2206091.351292]  ib_uverbs(OE) ib_core(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm mlx5_core(OE) libahci mlxfw(OE) devlink crct10dif_pclmul mpt3sas(OE) drm tg3 crct10dif_common mlx_compat(OE) libata raid_class crc32c_intel megaraid_sas ptp scsi_transport_sas drm_panel_orientation_quirks pps_core [last unloaded: osp]
[2206091.382061] CPU: 36 PID: 109638 Comm: mdt00_030 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
[2206091.395007] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
[2206091.402832] task: ffff8865c0ae5140 ti: ffff886229cd4000 task.ti: ffff886229cd4000
[2206091.410484] RIP: 0010:[<ffffffffc0c12784>]  [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass]
[2206091.420443] RSP: 0018:ffff886229cd7668  EFLAGS: 00010246
[2206091.425928] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000002000 RCX: 0000000000063c58
[2206091.433235] RDX: 0000000000000000 RSI: ffffffffc0c87f20 RDI: ffffffffc0cbd320
[2206091.440539] RBP: ffff886229cd7730 R08: 0000000000000001 R09: 0000000000000000
[2206091.447845] R10: 0000000000000000 R11: 0000000000000000 R12: ffff886229cd77e0
[2206091.455152] R13: ffff888080b9b000 R14: ffff8867bb808c00 R15: 000000000000fc8d
[2206091.462458] FS:  00007fa4a0ec4700(0000) GS:ffff885bff040000(0000) knlGS:0000000000000000
[2206091.470717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2206091.476636] CR2: 00007fa4ae229000 CR3: 000000318c610000 CR4: 00000000003407e0
[2206091.483944] Call Trace:
[2206091.486596]  [<ffffffffc0c08737>] ? llog_cat_cancel_records+0x1e7/0x3c0 [obdclass]
[2206091.494348]  [<ffffffffc1487814>] ? osd_trunc_unlock_all+0xf4/0x160 [osd_ldiskfs]
[2206091.502011]  [<ffffffffc0c021e0>] llog_process_thread+0x330/0x18e0 [obdclass]
[2206091.509329]  [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
[2206091.516564]  [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[2206091.523800]  [<ffffffffc0c08b49>] llog_cat_process_cb+0x239/0x250 [obdclass]
[2206091.531038]  [<ffffffffc0c026df>] llog_process_thread+0x82f/0x18e0 [obdclass]
[2206091.538361]  [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass]
[2206091.546116]  [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[2206091.553353]  [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass]
[2206091.561111]  [<ffffffffc0c0554e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
[2206091.568786]  [<ffffffffc0c2dba1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[2206091.576093]  [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
[2206091.583322]  [<ffffffffc0c0575e>] llog_cat_process+0x2e/0x30 [obdclass]
[2206091.590116]  [<ffffffffc17f1a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
[2206091.597682]  [<ffffffff8c703612>] ? ktime_get+0x52/0xe0
[2206091.603089]  [<ffffffffc17f3b60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
[2206091.610310]  [<ffffffffc17f6c13>] mdd_changelog_clear+0x503/0x690 [mdd]
[2206091.617105]  [<ffffffffc17f9cf3>] mdd_iocontrol+0x163/0x540 [mdd]
[2206091.623396]  [<ffffffffc0c4d663>] ? lu_context_init+0xd3/0x1f0 [obdclass]
[2206091.630367]  [<ffffffffc172e8cc>] mdt_iocontrol+0x5ec/0xb00 [mdt]
[2206091.636643]  [<ffffffffc172f264>] mdt_set_info+0x484/0x490 [mdt]
[2206091.642883]  [<ffffffffc0f8f36a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[2206091.649965]  [<ffffffffc0f6ada1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[2206091.657712]  [<ffffffffc0ab2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[2206091.664964]  [<ffffffffc0f3624b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2206091.672824]  [<ffffffffc0f31805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[2206091.679791]  [<ffffffff8c6cfeb4>] ? __wake_up+0x44/0x50
[2206091.685217]  [<ffffffffc0f39bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[2206091.691689]  [<ffffffffc0f39080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[2206091.699255]  [<ffffffff8c6c2e81>] kthread+0xd1/0xe0
[2206091.704303]  [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40
[2206091.710573]  [<ffffffff8cd77c24>] ret_from_fork_nospec_begin+0xe/0x21
[2206091.717183]  [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40
[2206091.723447] Code: 05 d6 ab 0a 00 00 00 02 00 48 c7 c7 20 d3 cb c0 48 8b 40 08 48 c7 05 a8 ab 0a 00 b0 76 c8 c0 45 8b 4e 54 45 8b 46 50 49 8b 4e 48 <48> 8b 50 28 49 8b 04 24 89 5c 24 10 48 89 44 24 08 41 8b 46 58 
[2206091.744144] RIP  [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass]
[2206091.751763]  RSP <ffff886229cd7668>
 

Crash shows 2 running threads doing both mdd_changelog_llog_cancel:

crash> ps | grep ">"
>     0      0   0  ffffffff8d218480  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff885ca9f38000  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffff886ca9400000  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffff887ca9eae180  RU   0.0       0      0  [swapper/3]
>     0      0   4  ffff884d29b96180  RU   0.0       0      0  [swapper/4]
>     0      0   5  ffff885ca9f3e180  RU   0.0       0      0  [swapper/5]
>     0      0   6  ffff886ca9406180  RU   0.0       0      0  [swapper/6]
>     0      0   7  ffff887ca9eaa080  RU   0.0       0      0  [swapper/7]
>     0      0   8  ffff884d29b930c0  RU   0.0       0      0  [swapper/8]
>     0      0   9  ffff885ca9f39040  RU   0.0       0      0  [swapper/9]
>     0      0  10  ffff886ca9401040  RU   0.0       0      0  [swapper/10]
>     0      0  11  ffff887ca9ead140  RU   0.0       0      0  [swapper/11]
>     0      0  12  ffff884d29b95140  RU   0.0       0      0  [swapper/12]
>     0      0  13  ffff885ca9f3d140  RU   0.0       0      0  [swapper/13]
>     0      0  14  ffff886ca9405140  RU   0.0       0      0  [swapper/14]
>     0      0  15  ffff887ca9eab0c0  RU   0.0       0      0  [swapper/15]
>     0      0  16  ffff884d29b94100  RU   0.0       0      0  [swapper/16]
>     0      0  17  ffff885ca9f3a080  RU   0.0       0      0  [swapper/17]
>     0      0  18  ffff886ca9402080  RU   0.0       0      0  [swapper/18]
>     0      0  19  ffff887ca9eac100  RU   0.0       0      0  [swapper/19]
>     0      0  20  ffff884d29bb8000  RU   0.0       0      0  [swapper/20]
>     0      0  21  ffff885ca9f3c100  RU   0.0       0      0  [swapper/21]
>     0      0  22  ffff886ca9404100  RU   0.0       0      0  [swapper/22]
>     0      0  23  ffff887ca9f20000  RU   0.0       0      0  [swapper/23]
>     0      0  24  ffff884d29bbe180  RU   0.0       0      0  [swapper/24]
>     0      0  25  ffff885ca9f3b0c0  RU   0.0       0      0  [swapper/25]
>     0      0  26  ffff886ca94030c0  RU   0.0       0      0  [swapper/26]
>     0      0  27  ffff887ca9f26180  RU   0.0       0      0  [swapper/27]
>     0      0  28  ffff884d29bb9040  RU   0.0       0      0  [swapper/28]
>     0      0  29  ffff885ca9f78000  RU   0.0       0      0  [swapper/29]
>     0      0  30  ffff886ca9430000  RU   0.0       0      0  [swapper/30]
>     0      0  31  ffff887ca9f21040  RU   0.0       0      0  [swapper/31]
>     0      0  32  ffff884d29bbd140  RU   0.0       0      0  [swapper/32]
>     0      0  33  ffff885ca9f7e180  RU   0.0       0      0  [swapper/33]
>     0      0  34  ffff886ca9436180  RU   0.0       0      0  [swapper/34]
>     0      0  35  ffff887ca9f25140  RU   0.0       0      0  [swapper/35]
>     0      0  37  ffff885ca9f79040  RU   0.0       0      0  [swapper/37]
>     0      0  38  ffff886ca9431040  RU   0.0       0      0  [swapper/38]
>     0      0  39  ffff887ca9f22080  RU   0.0       0      0  [swapper/39]
>     0      0  40  ffff884d29bbc100  RU   0.0       0      0  [swapper/40]
>     0      0  41  ffff885ca9f7d140  RU   0.0       0      0  [swapper/41]
>     0      0  42  ffff886ca9435140  RU   0.0       0      0  [swapper/42]
>     0      0  43  ffff887ca9f24100  RU   0.0       0      0  [swapper/43]
>     0      0  45  ffff885ca9f7a080  RU   0.0       0      0  [swapper/45]
>     0      0  46  ffff886ca9432080  RU   0.0       0      0  [swapper/46]
>     0      0  47  ffff887ca9f230c0  RU   0.0       0      0  [swapper/47]
> 43092      2  44  ffff885bf9c6c100  RU   0.0       0      0  [mdt00_080]
> 109638      2  36  ffff8865c0ae5140  RU   0.0       0      0  [mdt00_030]
crash> bt 43092
PID: 43092  TASK: ffff885bf9c6c100  CPU: 44  COMMAND: "mdt00_080"
 #0 [ffff885bff0c8e48] crash_nmi_callback at ffffffff8c656027
 #1 [ffff885bff0c8e58] nmi_handle at ffffffff8cd6f91c
 #2 [ffff885bff0c8eb0] do_nmi at ffffffff8cd6fb3d
 #3 [ffff885bff0c8ef0] end_repeat_nmi at ffffffff8cd6ed89
    [exception RIP: osd_idc_add+6]
    RIP: ffffffffc1459f36  RSP: ffff88517f577510  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff8853762ac800  RCX: 0000000000000002
    RDX: ffff884f2a223a40  RSI: ffff888b8fa7e000  RDI: ffff88517f577c20
    RBP: ffff88517f577548   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000006  R11: ffff8877d65d7a00  R12: ffff884f2a223a40
    R13: ffff888b8fa7e000  R14: ffff88517f577c20  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff88517f577510] osd_idc_add at ffffffffc1459f36 [osd_ldiskfs]
 #5 [ffff88517f577518] osd_idc_find_and_init at ffffffffc145f360 [osd_ldiskfs]
 #6 [ffff88517f577550] osd_declare_destroy at ffffffffc1461deb [osd_ldiskfs]
 #7 [ffff88517f5775a8] llog_osd_declare_destroy at ffffffffc0c15ab1 [obdclass]
 #8 [ffff88517f5775e8] llog_declare_destroy at ffffffffc0bfea04 [obdclass]
 #9 [ffff88517f577610] llog_cancel_rec at ffffffffc0c019c6 [obdclass]
#10 [ffff88517f577670] llog_cat_cancel_records at ffffffffc0c08720 [obdclass]
#11 [ffff88517f5776d8] llog_changelog_cancel_cb at ffffffffc17f5264 [mdd]
#12 [ffff88517f577738] llog_process_thread at ffffffffc0c026df [obdclass]
#13 [ffff88517f577848] llog_process_or_fork at ffffffffc0c0384c [obdclass]
#14 [ffff88517f577890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass]
#15 [ffff88517f5778e0] llog_process_thread at ffffffffc0c026df [obdclass]
#16 [ffff88517f5779f0] llog_process_or_fork at ffffffffc0c0384c [obdclass]
#17 [ffff88517f577a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass]
#18 [ffff88517f577ab0] llog_cat_process at ffffffffc0c0575e [obdclass]
#19 [ffff88517f577ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd]
#20 [ffff88517f577b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd]
#21 [ffff88517f577b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd]
#22 [ffff88517f577bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd]
#23 [ffff88517f577bf8] mdt_iocontrol at ffffffffc172e8cc [mdt]
#24 [ffff88517f577c90] mdt_set_info at ffffffffc172f264 [mdt]
#25 [ffff88517f577cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc]
#26 [ffff88517f577d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc]
#27 [ffff88517f577df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc]
#28 [ffff88517f577ec8] kthread at ffffffff8c6c2e81
#29 [ffff88517f577f50] ret_from_fork_nospec_begin at ffffffff8cd77c24
crash> bt 109638
PID: 109638  TASK: ffff8865c0ae5140  CPU: 36  COMMAND: "mdt00_030"
 #0 [ffff886229cd73d8] machine_kexec at ffffffff8c663b34
 #1 [ffff886229cd7438] __crash_kexec at ffffffff8c71e242
 #2 [ffff886229cd7508] crash_kexec at ffffffff8c71e330
 #3 [ffff886229cd7520] oops_end at ffffffff8cd6f778
 #4 [ffff886229cd7548] die at ffffffff8c62f95b
 #5 [ffff886229cd7578] do_general_protection at ffffffff8cd6f072
 #6 [ffff886229cd75b0] general_protection at ffffffff8cd6e708
    [exception RIP: llog_osd_next_block+2404]
    RIP: ffffffffc0c12784  RSP: ffff886229cd7668  RFLAGS: 00010246
    RAX: 5a5a5a5a5a5a5a5a  RBX: 0000000000002000  RCX: 0000000000063c58
    RDX: 0000000000000000  RSI: ffffffffc0c87f20  RDI: ffffffffc0cbd320
    RBP: ffff886229cd7730   R8: 0000000000000001   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff886229cd77e0
    R13: ffff888080b9b000  R14: ffff8867bb808c00  R15: 000000000000fc8d
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff886229cd7660] llog_osd_next_block at ffffffffc0c1218a [obdclass]
 #8 [ffff886229cd7738] llog_process_thread at ffffffffc0c021e0 [obdclass]
 #9 [ffff886229cd7848] llog_process_or_fork at ffffffffc0c0384c [obdclass]
#10 [ffff886229cd7890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass]
#11 [ffff886229cd78e0] llog_process_thread at ffffffffc0c026df [obdclass]
#12 [ffff886229cd79f0] llog_process_or_fork at ffffffffc0c0384c [obdclass]
#13 [ffff886229cd7a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass]
#14 [ffff886229cd7ab0] llog_cat_process at ffffffffc0c0575e [obdclass]
#15 [ffff886229cd7ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd]
#16 [ffff886229cd7b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd]
#17 [ffff886229cd7b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd]
#18 [ffff886229cd7bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd]
#19 [ffff886229cd7bf8] mdt_iocontrol at ffffffffc172e8cc [mdt]
#20 [ffff886229cd7c90] mdt_set_info at ffffffffc172f264 [mdt]
#21 [ffff886229cd7cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc]
#22 [ffff886229cd7d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc]
#23 [ffff886229cd7df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc]
#24 [ffff886229cd7ec8] kthread at ffffffff8c6c2e81
#25 [ffff886229cd7f50] ret_from_fork_nospec_begin at ffffffff8cd77c24


 Comments   
Comment by Peter Jones [ 07/Jan/20 ]

Yang Sheng

Could you please investigate?

Thanks

Peter

Comment by Yang Sheng [ 15/Jan/20 ]

Hi, Stephane,

I encountered a error while decode symbol in crash:

crash> rd 0xffff8861e76ffd00
rd: page excluded: kernel virtual address: ffff8861e76ffd00  type: "64-bit KVADDR"

It is really wired, The kernel image should be matched exactly. Do you deploy lustre by some own building? If so, could you please share your lustre-debuginfo rpm to me?

Thanks,
Yangsheng

Comment by Stephane Thiell [ 15/Jan/20 ]

Hi YangSheng,

Yes, we build our own. I just uploaded lustre-debuginfo-2.12.3_4_g142b4d4-1.el7.x86_64.rpm to the Whamcloud FTP. The Kernel debuginfo packages are also available there:

  • kernel-debuginfo-3.10.0-957.27.2.el7_lustre.pl2.x86_64.rpm
  • kernel-debuginfo-common-x86_64-3.10.0-957.27.2.el7_lustre.pl2.x86_64.rpm

Let me know if you need anything else.
Thanks!

Generated at Sat Feb 10 02:58:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.