Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13113

MDT changelogs crash RIP: llog_osd_next_block

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6, Lustre 2.12.3_4
    • 3
    • 9223372036854775807

    Description

      Happy New Year

      We hit the following 2.12.3 MDS crash on MDT0000 while rebooting the robinhood server:

      [2206088.009705] Lustre: 109638:0:(llog_cat.c:894:llog_cat_process_or_fork()) fir-MDD0000: catlog [0x5:0xa:0x0] crosses index zero
      [2206091.272923] general protection fault: 0000 [#1] SMP 
      [2206091.278116] Modules linked in: osp(OE) mdd(OE) mdt(OE) lustre(OE) mdc(OE) mgs(OE) lod(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lmv(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas ipmi_si ses ipmi_devintf aesni_intel lrw gf128mul glue_helper enclosure ablk_helper pcspkr cryptd sg dm_multipath ipmi_msghandler ccp acpi_power_meter dm_mod i2c_piix4 k10temp ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE)
      [2206091.351292]  ib_uverbs(OE) ib_core(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm mlx5_core(OE) libahci mlxfw(OE) devlink crct10dif_pclmul mpt3sas(OE) drm tg3 crct10dif_common mlx_compat(OE) libata raid_class crc32c_intel megaraid_sas ptp scsi_transport_sas drm_panel_orientation_quirks pps_core [last unloaded: osp]
      [2206091.382061] CPU: 36 PID: 109638 Comm: mdt00_030 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [2206091.395007] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
      [2206091.402832] task: ffff8865c0ae5140 ti: ffff886229cd4000 task.ti: ffff886229cd4000
      [2206091.410484] RIP: 0010:[<ffffffffc0c12784>]  [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass]
      [2206091.420443] RSP: 0018:ffff886229cd7668  EFLAGS: 00010246
      [2206091.425928] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000002000 RCX: 0000000000063c58
      [2206091.433235] RDX: 0000000000000000 RSI: ffffffffc0c87f20 RDI: ffffffffc0cbd320
      [2206091.440539] RBP: ffff886229cd7730 R08: 0000000000000001 R09: 0000000000000000
      [2206091.447845] R10: 0000000000000000 R11: 0000000000000000 R12: ffff886229cd77e0
      [2206091.455152] R13: ffff888080b9b000 R14: ffff8867bb808c00 R15: 000000000000fc8d
      [2206091.462458] FS:  00007fa4a0ec4700(0000) GS:ffff885bff040000(0000) knlGS:0000000000000000
      [2206091.470717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [2206091.476636] CR2: 00007fa4ae229000 CR3: 000000318c610000 CR4: 00000000003407e0
      [2206091.483944] Call Trace:
      [2206091.486596]  [<ffffffffc0c08737>] ? llog_cat_cancel_records+0x1e7/0x3c0 [obdclass]
      [2206091.494348]  [<ffffffffc1487814>] ? osd_trunc_unlock_all+0xf4/0x160 [osd_ldiskfs]
      [2206091.502011]  [<ffffffffc0c021e0>] llog_process_thread+0x330/0x18e0 [obdclass]
      [2206091.509329]  [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
      [2206091.516564]  [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [2206091.523800]  [<ffffffffc0c08b49>] llog_cat_process_cb+0x239/0x250 [obdclass]
      [2206091.531038]  [<ffffffffc0c026df>] llog_process_thread+0x82f/0x18e0 [obdclass]
      [2206091.538361]  [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass]
      [2206091.546116]  [<ffffffffc0c0384c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [2206091.553353]  [<ffffffffc0c08910>] ? llog_cat_cancel_records+0x3c0/0x3c0 [obdclass]
      [2206091.561111]  [<ffffffffc0c0554e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
      [2206091.568786]  [<ffffffffc0c2dba1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
      [2206091.576093]  [<ffffffffc17f5160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
      [2206091.583322]  [<ffffffffc0c0575e>] llog_cat_process+0x2e/0x30 [obdclass]
      [2206091.590116]  [<ffffffffc17f1a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
      [2206091.597682]  [<ffffffff8c703612>] ? ktime_get+0x52/0xe0
      [2206091.603089]  [<ffffffffc17f3b60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
      [2206091.610310]  [<ffffffffc17f6c13>] mdd_changelog_clear+0x503/0x690 [mdd]
      [2206091.617105]  [<ffffffffc17f9cf3>] mdd_iocontrol+0x163/0x540 [mdd]
      [2206091.623396]  [<ffffffffc0c4d663>] ? lu_context_init+0xd3/0x1f0 [obdclass]
      [2206091.630367]  [<ffffffffc172e8cc>] mdt_iocontrol+0x5ec/0xb00 [mdt]
      [2206091.636643]  [<ffffffffc172f264>] mdt_set_info+0x484/0x490 [mdt]
      [2206091.642883]  [<ffffffffc0f8f36a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [2206091.649965]  [<ffffffffc0f6ada1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [2206091.657712]  [<ffffffffc0ab2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [2206091.664964]  [<ffffffffc0f3624b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2206091.672824]  [<ffffffffc0f31805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [2206091.679791]  [<ffffffff8c6cfeb4>] ? __wake_up+0x44/0x50
      [2206091.685217]  [<ffffffffc0f39bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [2206091.691689]  [<ffffffffc0f39080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [2206091.699255]  [<ffffffff8c6c2e81>] kthread+0xd1/0xe0
      [2206091.704303]  [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40
      [2206091.710573]  [<ffffffff8cd77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [2206091.717183]  [<ffffffff8c6c2db0>] ? insert_kthread_work+0x40/0x40
      [2206091.723447] Code: 05 d6 ab 0a 00 00 00 02 00 48 c7 c7 20 d3 cb c0 48 8b 40 08 48 c7 05 a8 ab 0a 00 b0 76 c8 c0 45 8b 4e 54 45 8b 46 50 49 8b 4e 48 <48> 8b 50 28 49 8b 04 24 89 5c 24 10 48 89 44 24 08 41 8b 46 58 
      [2206091.744144] RIP  [<ffffffffc0c12784>] llog_osd_next_block+0x964/0xbc0 [obdclass]
      [2206091.751763]  RSP <ffff886229cd7668>
       

      Crash shows 2 running threads doing both mdd_changelog_llog_cancel:

      crash> ps | grep ">"
      >     0      0   0  ffffffff8d218480  RU   0.0       0      0  [swapper/0]
      >     0      0   1  ffff885ca9f38000  RU   0.0       0      0  [swapper/1]
      >     0      0   2  ffff886ca9400000  RU   0.0       0      0  [swapper/2]
      >     0      0   3  ffff887ca9eae180  RU   0.0       0      0  [swapper/3]
      >     0      0   4  ffff884d29b96180  RU   0.0       0      0  [swapper/4]
      >     0      0   5  ffff885ca9f3e180  RU   0.0       0      0  [swapper/5]
      >     0      0   6  ffff886ca9406180  RU   0.0       0      0  [swapper/6]
      >     0      0   7  ffff887ca9eaa080  RU   0.0       0      0  [swapper/7]
      >     0      0   8  ffff884d29b930c0  RU   0.0       0      0  [swapper/8]
      >     0      0   9  ffff885ca9f39040  RU   0.0       0      0  [swapper/9]
      >     0      0  10  ffff886ca9401040  RU   0.0       0      0  [swapper/10]
      >     0      0  11  ffff887ca9ead140  RU   0.0       0      0  [swapper/11]
      >     0      0  12  ffff884d29b95140  RU   0.0       0      0  [swapper/12]
      >     0      0  13  ffff885ca9f3d140  RU   0.0       0      0  [swapper/13]
      >     0      0  14  ffff886ca9405140  RU   0.0       0      0  [swapper/14]
      >     0      0  15  ffff887ca9eab0c0  RU   0.0       0      0  [swapper/15]
      >     0      0  16  ffff884d29b94100  RU   0.0       0      0  [swapper/16]
      >     0      0  17  ffff885ca9f3a080  RU   0.0       0      0  [swapper/17]
      >     0      0  18  ffff886ca9402080  RU   0.0       0      0  [swapper/18]
      >     0      0  19  ffff887ca9eac100  RU   0.0       0      0  [swapper/19]
      >     0      0  20  ffff884d29bb8000  RU   0.0       0      0  [swapper/20]
      >     0      0  21  ffff885ca9f3c100  RU   0.0       0      0  [swapper/21]
      >     0      0  22  ffff886ca9404100  RU   0.0       0      0  [swapper/22]
      >     0      0  23  ffff887ca9f20000  RU   0.0       0      0  [swapper/23]
      >     0      0  24  ffff884d29bbe180  RU   0.0       0      0  [swapper/24]
      >     0      0  25  ffff885ca9f3b0c0  RU   0.0       0      0  [swapper/25]
      >     0      0  26  ffff886ca94030c0  RU   0.0       0      0  [swapper/26]
      >     0      0  27  ffff887ca9f26180  RU   0.0       0      0  [swapper/27]
      >     0      0  28  ffff884d29bb9040  RU   0.0       0      0  [swapper/28]
      >     0      0  29  ffff885ca9f78000  RU   0.0       0      0  [swapper/29]
      >     0      0  30  ffff886ca9430000  RU   0.0       0      0  [swapper/30]
      >     0      0  31  ffff887ca9f21040  RU   0.0       0      0  [swapper/31]
      >     0      0  32  ffff884d29bbd140  RU   0.0       0      0  [swapper/32]
      >     0      0  33  ffff885ca9f7e180  RU   0.0       0      0  [swapper/33]
      >     0      0  34  ffff886ca9436180  RU   0.0       0      0  [swapper/34]
      >     0      0  35  ffff887ca9f25140  RU   0.0       0      0  [swapper/35]
      >     0      0  37  ffff885ca9f79040  RU   0.0       0      0  [swapper/37]
      >     0      0  38  ffff886ca9431040  RU   0.0       0      0  [swapper/38]
      >     0      0  39  ffff887ca9f22080  RU   0.0       0      0  [swapper/39]
      >     0      0  40  ffff884d29bbc100  RU   0.0       0      0  [swapper/40]
      >     0      0  41  ffff885ca9f7d140  RU   0.0       0      0  [swapper/41]
      >     0      0  42  ffff886ca9435140  RU   0.0       0      0  [swapper/42]
      >     0      0  43  ffff887ca9f24100  RU   0.0       0      0  [swapper/43]
      >     0      0  45  ffff885ca9f7a080  RU   0.0       0      0  [swapper/45]
      >     0      0  46  ffff886ca9432080  RU   0.0       0      0  [swapper/46]
      >     0      0  47  ffff887ca9f230c0  RU   0.0       0      0  [swapper/47]
      > 43092      2  44  ffff885bf9c6c100  RU   0.0       0      0  [mdt00_080]
      > 109638      2  36  ffff8865c0ae5140  RU   0.0       0      0  [mdt00_030]
      crash> bt 43092
      PID: 43092  TASK: ffff885bf9c6c100  CPU: 44  COMMAND: "mdt00_080"
       #0 [ffff885bff0c8e48] crash_nmi_callback at ffffffff8c656027
       #1 [ffff885bff0c8e58] nmi_handle at ffffffff8cd6f91c
       #2 [ffff885bff0c8eb0] do_nmi at ffffffff8cd6fb3d
       #3 [ffff885bff0c8ef0] end_repeat_nmi at ffffffff8cd6ed89
          [exception RIP: osd_idc_add+6]
          RIP: ffffffffc1459f36  RSP: ffff88517f577510  RFLAGS: 00000246
          RAX: 0000000000000000  RBX: ffff8853762ac800  RCX: 0000000000000002
          RDX: ffff884f2a223a40  RSI: ffff888b8fa7e000  RDI: ffff88517f577c20
          RBP: ffff88517f577548   R8: 0000000000000000   R9: 0000000000000000
          R10: 0000000000000006  R11: ffff8877d65d7a00  R12: ffff884f2a223a40
          R13: ffff888b8fa7e000  R14: ffff88517f577c20  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #4 [ffff88517f577510] osd_idc_add at ffffffffc1459f36 [osd_ldiskfs]
       #5 [ffff88517f577518] osd_idc_find_and_init at ffffffffc145f360 [osd_ldiskfs]
       #6 [ffff88517f577550] osd_declare_destroy at ffffffffc1461deb [osd_ldiskfs]
       #7 [ffff88517f5775a8] llog_osd_declare_destroy at ffffffffc0c15ab1 [obdclass]
       #8 [ffff88517f5775e8] llog_declare_destroy at ffffffffc0bfea04 [obdclass]
       #9 [ffff88517f577610] llog_cancel_rec at ffffffffc0c019c6 [obdclass]
      #10 [ffff88517f577670] llog_cat_cancel_records at ffffffffc0c08720 [obdclass]
      #11 [ffff88517f5776d8] llog_changelog_cancel_cb at ffffffffc17f5264 [mdd]
      #12 [ffff88517f577738] llog_process_thread at ffffffffc0c026df [obdclass]
      #13 [ffff88517f577848] llog_process_or_fork at ffffffffc0c0384c [obdclass]
      #14 [ffff88517f577890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass]
      #15 [ffff88517f5778e0] llog_process_thread at ffffffffc0c026df [obdclass]
      #16 [ffff88517f5779f0] llog_process_or_fork at ffffffffc0c0384c [obdclass]
      #17 [ffff88517f577a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass]
      #18 [ffff88517f577ab0] llog_cat_process at ffffffffc0c0575e [obdclass]
      #19 [ffff88517f577ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd]
      #20 [ffff88517f577b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd]
      #21 [ffff88517f577b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd]
      #22 [ffff88517f577bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd]
      #23 [ffff88517f577bf8] mdt_iocontrol at ffffffffc172e8cc [mdt]
      #24 [ffff88517f577c90] mdt_set_info at ffffffffc172f264 [mdt]
      #25 [ffff88517f577cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc]
      #26 [ffff88517f577d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc]
      #27 [ffff88517f577df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc]
      #28 [ffff88517f577ec8] kthread at ffffffff8c6c2e81
      #29 [ffff88517f577f50] ret_from_fork_nospec_begin at ffffffff8cd77c24
      crash> bt 109638
      PID: 109638  TASK: ffff8865c0ae5140  CPU: 36  COMMAND: "mdt00_030"
       #0 [ffff886229cd73d8] machine_kexec at ffffffff8c663b34
       #1 [ffff886229cd7438] __crash_kexec at ffffffff8c71e242
       #2 [ffff886229cd7508] crash_kexec at ffffffff8c71e330
       #3 [ffff886229cd7520] oops_end at ffffffff8cd6f778
       #4 [ffff886229cd7548] die at ffffffff8c62f95b
       #5 [ffff886229cd7578] do_general_protection at ffffffff8cd6f072
       #6 [ffff886229cd75b0] general_protection at ffffffff8cd6e708
          [exception RIP: llog_osd_next_block+2404]
          RIP: ffffffffc0c12784  RSP: ffff886229cd7668  RFLAGS: 00010246
          RAX: 5a5a5a5a5a5a5a5a  RBX: 0000000000002000  RCX: 0000000000063c58
          RDX: 0000000000000000  RSI: ffffffffc0c87f20  RDI: ffffffffc0cbd320
          RBP: ffff886229cd7730   R8: 0000000000000001   R9: 0000000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff886229cd77e0
          R13: ffff888080b9b000  R14: ffff8867bb808c00  R15: 000000000000fc8d
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff886229cd7660] llog_osd_next_block at ffffffffc0c1218a [obdclass]
       #8 [ffff886229cd7738] llog_process_thread at ffffffffc0c021e0 [obdclass]
       #9 [ffff886229cd7848] llog_process_or_fork at ffffffffc0c0384c [obdclass]
      #10 [ffff886229cd7890] llog_cat_process_cb at ffffffffc0c08b49 [obdclass]
      #11 [ffff886229cd78e0] llog_process_thread at ffffffffc0c026df [obdclass]
      #12 [ffff886229cd79f0] llog_process_or_fork at ffffffffc0c0384c [obdclass]
      #13 [ffff886229cd7a38] llog_cat_process_or_fork at ffffffffc0c0554e [obdclass]
      #14 [ffff886229cd7ab0] llog_cat_process at ffffffffc0c0575e [obdclass]
      #15 [ffff886229cd7ad0] llog_changelog_cancel at ffffffffc17f1a34 [mdd]
      #16 [ffff886229cd7b10] mdd_changelog_llog_cancel at ffffffffc17f3b60 [mdd]
      #17 [ffff886229cd7b60] mdd_changelog_clear at ffffffffc17f6c13 [mdd]
      #18 [ffff886229cd7bb8] mdd_iocontrol at ffffffffc17f9cf3 [mdd]
      #19 [ffff886229cd7bf8] mdt_iocontrol at ffffffffc172e8cc [mdt]
      #20 [ffff886229cd7c90] mdt_set_info at ffffffffc172f264 [mdt]
      #21 [ffff886229cd7cd0] tgt_request_handle at ffffffffc0f8f36a [ptlrpc]
      #22 [ffff886229cd7d58] ptlrpc_server_handle_request at ffffffffc0f3624b [ptlrpc]
      #23 [ffff886229cd7df8] ptlrpc_main at ffffffffc0f39bac [ptlrpc]
      #24 [ffff886229cd7ec8] kthread at ffffffff8c6c2e81
      #25 [ffff886229cd7f50] ret_from_fork_nospec_begin at ffffffff8cd77c24
      

      Attachments

        Issue Links

          Activity

            People

              ys Yang Sheng
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: