Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.3
-
3
-
9223372036854775807
Description
We hit this crash for the first time last night on one of Fir's MDS (fir-md1-s3, serving fir-MDT0002):
[2786965.963124] ------------[ cut here ]------------ [2786965.967920] kernel BUG at /tmp/rpmbuild-lustre-sthiell-Xc32PcQQ/BUILD/lustre-2.12.3_2_gb033996/ldiskfs/htree_lock.c:429! [2786965.978953] invalid opcode: 0000 [#1] SMP [2786965.983276] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ses enclosure ghash_clmulni_intel dcdbas aesni_intel lrw gf128mul glue_helper ablk_helper ipmi_si cryptd sg ipmi_devintf pcspkr ccp ipmi_msghandler i2c_piix4 k10temp dm_multipath acpi_power_meter dm_mod ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE) [2786966.055730] ib_uverbs(OE) ib_core(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt mlx5_core(OE) fb_sys_fops ttm mlxfw(OE) devlink ahci libahci mpt3sas(OE) drm tg3 crct10dif_pclmul mlx_compat(OE) crct10dif_common raid_class crc32c_intel libata ptp megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core [last unloaded: libcfs] [2786966.086761] CPU: 1 PID: 68784 Comm: mdt01_110 Kdump: loaded Tainted: G OEL ------------ 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 [2786966.099526] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019 [2786966.107352] task: ffff9a6086df9040 ti: ffff9a7916f4c000 task.ti: ffff9a7916f4c000 [2786966.115003] RIP: 0010:[<ffffffffc15b7b24>] [<ffffffffc15b7b24>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs] [2786966.124694] RSP: 0018:ffff9a7916f4f8b0 EFLAGS: 00010246 [2786966.130180] RAX: ffff9a57f63e7000 RBX: 0000000000000001 RCX: ffff9a6611112490 [2786966.137487] RDX: 00000000000000c8 RSI: 0000000000000001 RDI: 0000000000000000 [2786966.144792] RBP: ffff9a7916f4f928 R08: ffff9a7720ec6b60 R09: ffff9a610b87c100 [2786966.152098] R10: 0000000000000000 R11: ffff9a709075811f R12: ffff9a66111124d8 [2786966.159403] R13: 0000000000000000 R14: ffff9a6fcf88d040 R15: ffff9a70907580fc [2786966.166711] FS: 00007f32e0150700(0000) GS:ffff9a71bf600000(0000) knlGS:0000000000000000 [2786966.174970] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2786966.180890] CR2: 00007f32e0224000 CR3: 0000002035ab2000 CR4: 00000000003407e0 [2786966.188196] Call Trace: [2786966.190835] [<ffffffffc15b7d0a>] htree_node_release_all+0x5a/0x80 [ldiskfs] [2786966.198061] [<ffffffffc15b7d52>] htree_unlock+0x22/0x70 [ldiskfs] [2786966.204423] [<ffffffffc168ba9e>] osd_index_ea_delete+0x30e/0xb10 [osd_ldiskfs] [2786966.211917] [<ffffffffc18f59e8>] lod_sub_delete+0x1c8/0x460 [lod] [2786966.218281] [<ffffffffc159c1b9>] ? __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [2786966.226026] [<ffffffffc18d0aa4>] lod_delete+0x24/0x30 [lod] [2786966.231872] [<ffffffffc19457b4>] __mdd_index_delete_only+0x194/0x250 [mdd] [2786966.239007] [<ffffffffc1948d46>] __mdd_index_delete+0x46/0x290 [mdd] [2786966.245631] [<ffffffffc1955cf8>] mdd_unlink+0x5f8/0xaa0 [mdd] [2786966.251658] [<ffffffffc1818f03>] mdo_unlink+0x46/0x48 [mdt] [2786966.257502] [<ffffffffc17dcfed>] mdt_reint_unlink+0xbed/0x14b0 [mdt] [2786966.264131] [<ffffffffc17e1693>] mdt_reint_rec+0x83/0x210 [mdt] [2786966.270317] [<ffffffffc17be1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [2786966.277027] [<ffffffffc17c63d4>] ? mdt_thread_info_init+0xa4/0x1e0 [mdt] [2786966.283994] [<ffffffffc17c9567>] mdt_reint+0x67/0x140 [mdt] [2786966.289890] [<ffffffffc121936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [2786966.296973] [<ffffffffc11f4da1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2786966.304723] [<ffffffffc0de1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2786966.311982] [<ffffffffc11c024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2786966.319841] [<ffffffffc11bb805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2786966.326802] [<ffffffffb3ecfeb4>] ? __wake_up+0x44/0x50 [2786966.332241] [<ffffffffc11c3bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [2786966.338715] [<ffffffffc11c3080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2786966.346283] [<ffffffffb3ec2e81>] kthread+0xd1/0xe0 [2786966.351335] [<ffffffffb3ec2db0>] ? insert_kthread_work+0x40/0x40 [2786966.357604] [<ffffffffb4577c24>] ret_from_fork_nospec_begin+0xe/0x21 [2786966.364214] [<ffffffffb3ec2db0>] ? insert_kthread_work+0x40/0x40 [2786966.370479] Code: 0f 0b 48 8b 45 90 8b 55 8c f3 90 0f a3 10 19 c9 85 c9 75 f5 f0 0f ab 10 19 c9 85 c9 0f 84 a4 fb ff ff eb e5 0f 1f 00 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 89 f0 48 [2786966.391175] RIP [<ffffffffc15b7b24>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs] [2786966.398516] RSP <ffff9a7916f4f8b0>
KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl1.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 48 DATE: Fri Dec 6 00:01:09 2019 UPTIME: 32 days, 06:08:13 LOAD AVERAGE: 28.61, 38.89, 22.90 TASKS: 1817 NODENAME: fir-md1-s3 RELEASE: 3.10.0-957.27.2.el7_lustre.pl1.x86_64 VERSION: #1 SMP Mon Aug 5 15:28:37 PDT 2019 MACHINE: x86_64 (1996 Mhz) MEMORY: 255.6 GB PANIC: "kernel BUG at /tmp/rpmbuild-lustre-sthiell-Xc32PcQQ/BUILD/lustre-2.12.3_2_gb033996/ldiskfs/htree_lock.c:429!" PID: 68784 COMMAND: "mdt01_110" TASK: ffff9a6086df9040 [THREAD_INFO: ffff9a7916f4c000] CPU: 1 STATE: TASK_RUNNING (PANIC)
crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 65891108 251.4 GB ---- FREE 30206180 115.2 GB 45% of TOTAL MEM USED 35684928 136.1 GB 54% of TOTAL MEM SHARED 28095095 107.2 GB 42% of TOTAL MEM BUFFERS 30333796 115.7 GB 46% of TOTAL MEM CACHED 247597 967.2 MB 0% of TOTAL MEM SLAB 4284394 16.3 GB 6% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 1048575 4 GB ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 1048575 4 GB 100% of TOTAL SWAP COMMIT LIMIT 33994129 129.7 GB ---- COMMITTED 178287 696.4 MB 0% of TOTAL LIMIT
Attaching:
- the output of "dumpe2fs -h /dev/mapper/md1-rbod2-mdt2" asĀ dumpe2fs_fir-MDT0002.txt
- vmcore-dmesg.txt as vmcore-dmesg_fir-md1-s3_2019_12_06.txt
- output of crash foreach bt as foreach_bt_fir-md1-s3_2019_12_06.txt
Also uploaded the vmcore to the WC FTP was vmcore_fir-md1-s3_2019_12_06
Hope that helps finding the root cause!
Stephane