Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13054

MDS kernel BUG at ldiskfs/htree_lock.c:429!

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.8
    • Lustre 2.12.3
    • 3
    • 9223372036854775807

    Description

      We hit this crash for the first time last night on one of Fir's MDS (fir-md1-s3, serving fir-MDT0002):

      [2786965.963124] ------------[ cut here ]------------
      [2786965.967920] kernel BUG at /tmp/rpmbuild-lustre-sthiell-Xc32PcQQ/BUILD/lustre-2.12.3_2_gb033996/ldiskfs/htree_lock.c:429!
      [2786965.978953] invalid opcode: 0000 [#1] SMP 
      [2786965.983276] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ses enclosure ghash_clmulni_intel dcdbas aesni_intel lrw gf128mul glue_helper ablk_helper ipmi_si cryptd sg ipmi_devintf pcspkr ccp ipmi_msghandler i2c_piix4 k10temp dm_multipath acpi_power_meter dm_mod ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE)
      [2786966.055730]  ib_uverbs(OE) ib_core(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt mlx5_core(OE) fb_sys_fops ttm mlxfw(OE) devlink ahci libahci mpt3sas(OE) drm tg3 crct10dif_pclmul mlx_compat(OE) crct10dif_common raid_class crc32c_intel libata ptp megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core [last unloaded: libcfs]
      [2786966.086761] CPU: 1 PID: 68784 Comm: mdt01_110 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1
      [2786966.099526] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019
      [2786966.107352] task: ffff9a6086df9040 ti: ffff9a7916f4c000 task.ti: ffff9a7916f4c000
      [2786966.115003] RIP: 0010:[<ffffffffc15b7b24>]  [<ffffffffc15b7b24>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs]
      [2786966.124694] RSP: 0018:ffff9a7916f4f8b0  EFLAGS: 00010246
      [2786966.130180] RAX: ffff9a57f63e7000 RBX: 0000000000000001 RCX: ffff9a6611112490
      [2786966.137487] RDX: 00000000000000c8 RSI: 0000000000000001 RDI: 0000000000000000
      [2786966.144792] RBP: ffff9a7916f4f928 R08: ffff9a7720ec6b60 R09: ffff9a610b87c100
      [2786966.152098] R10: 0000000000000000 R11: ffff9a709075811f R12: ffff9a66111124d8
      [2786966.159403] R13: 0000000000000000 R14: ffff9a6fcf88d040 R15: ffff9a70907580fc
      [2786966.166711] FS:  00007f32e0150700(0000) GS:ffff9a71bf600000(0000) knlGS:0000000000000000
      [2786966.174970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [2786966.180890] CR2: 00007f32e0224000 CR3: 0000002035ab2000 CR4: 00000000003407e0
      [2786966.188196] Call Trace:
      [2786966.190835]  [<ffffffffc15b7d0a>] htree_node_release_all+0x5a/0x80 [ldiskfs]
      [2786966.198061]  [<ffffffffc15b7d52>] htree_unlock+0x22/0x70 [ldiskfs]
      [2786966.204423]  [<ffffffffc168ba9e>] osd_index_ea_delete+0x30e/0xb10 [osd_ldiskfs]
      [2786966.211917]  [<ffffffffc18f59e8>] lod_sub_delete+0x1c8/0x460 [lod]
      [2786966.218281]  [<ffffffffc159c1b9>] ? __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
      [2786966.226026]  [<ffffffffc18d0aa4>] lod_delete+0x24/0x30 [lod]
      [2786966.231872]  [<ffffffffc19457b4>] __mdd_index_delete_only+0x194/0x250 [mdd]
      [2786966.239007]  [<ffffffffc1948d46>] __mdd_index_delete+0x46/0x290 [mdd]
      [2786966.245631]  [<ffffffffc1955cf8>] mdd_unlink+0x5f8/0xaa0 [mdd]
      [2786966.251658]  [<ffffffffc1818f03>] mdo_unlink+0x46/0x48 [mdt]
      [2786966.257502]  [<ffffffffc17dcfed>] mdt_reint_unlink+0xbed/0x14b0 [mdt]
      [2786966.264131]  [<ffffffffc17e1693>] mdt_reint_rec+0x83/0x210 [mdt]
      [2786966.270317]  [<ffffffffc17be1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [2786966.277027]  [<ffffffffc17c63d4>] ? mdt_thread_info_init+0xa4/0x1e0 [mdt]
      [2786966.283994]  [<ffffffffc17c9567>] mdt_reint+0x67/0x140 [mdt]
      [2786966.289890]  [<ffffffffc121936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [2786966.296973]  [<ffffffffc11f4da1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [2786966.304723]  [<ffffffffc0de1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [2786966.311982]  [<ffffffffc11c024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2786966.319841]  [<ffffffffc11bb805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [2786966.326802]  [<ffffffffb3ecfeb4>] ? __wake_up+0x44/0x50
      [2786966.332241]  [<ffffffffc11c3bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [2786966.338715]  [<ffffffffc11c3080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [2786966.346283]  [<ffffffffb3ec2e81>] kthread+0xd1/0xe0
      [2786966.351335]  [<ffffffffb3ec2db0>] ? insert_kthread_work+0x40/0x40
      [2786966.357604]  [<ffffffffb4577c24>] ret_from_fork_nospec_begin+0xe/0x21
      [2786966.364214]  [<ffffffffb3ec2db0>] ? insert_kthread_work+0x40/0x40
      [2786966.370479] Code: 0f 0b 48 8b 45 90 8b 55 8c f3 90 0f a3 10 19 c9 85 c9 75 f5 f0 0f ab 10 19 c9 85 c9 0f 84 a4 fb ff ff eb e5 0f 1f 00 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 89 f0 48 
      [2786966.391175] RIP  [<ffffffffc15b7b24>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs]
      [2786966.398516]  RSP <ffff9a7916f4f8b0>
      
            KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl1.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 48
              DATE: Fri Dec  6 00:01:09 2019
            UPTIME: 32 days, 06:08:13
      LOAD AVERAGE: 28.61, 38.89, 22.90
             TASKS: 1817
          NODENAME: fir-md1-s3
           RELEASE: 3.10.0-957.27.2.el7_lustre.pl1.x86_64
           VERSION: #1 SMP Mon Aug 5 15:28:37 PDT 2019
           MACHINE: x86_64  (1996 Mhz)
            MEMORY: 255.6 GB
             PANIC: "kernel BUG at /tmp/rpmbuild-lustre-sthiell-Xc32PcQQ/BUILD/lustre-2.12.3_2_gb033996/ldiskfs/htree_lock.c:429!"
               PID: 68784
           COMMAND: "mdt01_110"
              TASK: ffff9a6086df9040  [THREAD_INFO: ffff9a7916f4c000]
               CPU: 1
             STATE: TASK_RUNNING (PANIC)
      
      crash> kmem -i
                       PAGES        TOTAL      PERCENTAGE
          TOTAL MEM  65891108     251.4 GB         ----
               FREE  30206180     115.2 GB   45% of TOTAL MEM
               USED  35684928     136.1 GB   54% of TOTAL MEM
             SHARED  28095095     107.2 GB   42% of TOTAL MEM
            BUFFERS  30333796     115.7 GB   46% of TOTAL MEM
             CACHED   247597     967.2 MB    0% of TOTAL MEM
               SLAB  4284394      16.3 GB    6% of TOTAL MEM
      
         TOTAL HUGE        0            0         ----
          HUGE FREE        0            0    0% of TOTAL HUGE
      
         TOTAL SWAP  1048575         4 GB         ----
          SWAP USED        0            0    0% of TOTAL SWAP
          SWAP FREE  1048575         4 GB  100% of TOTAL SWAP
      
       COMMIT LIMIT  33994129     129.7 GB         ----
          COMMITTED   178287     696.4 MB    0% of TOTAL LIMIT
      

      Attaching:

      Also uploaded the vmcore to the WC FTP was vmcore_fir-md1-s3_2019_12_06

      Hope that helps finding the root cause!
      Stephane

      Attachments

        Issue Links

          Activity

            People

              ys Yang Sheng
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: