Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.16.0
-
None
-
2.16.0-rc4
-
3
-
9223372036854775807
Description
2 clients (wr-es-31, wr-es-29) crashed in soak testing when soak has been running for about 4 days
server and clients are
version=2.16.0_RC4
vmcore can be found on
wr-es-29: 127.0.0.1-2024-10-27-21:29:21 wr-es-31: 127.0.0.1-2024-10-28-13:32:16
[351552.564424] Lustre: 24975:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1 72.25.80.53@tcp:/sfa18k03/fid: [0x28003b737:0x2f97:0x0]// may get corrupted (rc -108) [351552.564435] Lustre: 24975:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1 72.25.80.53@tcp:/sfa18k03/fid: [0x28003b737:0x10a0:0x0]// may get corrupted (rc -108) [351612.548777] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [351612.674645] Lustre: sfa18k03-MDT0001-mdc-ffff94d2b0acc000: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp) [351612.677153] rcu: 30-...!: (1 GPs behind) idle=a5e/1/0x4000000000000002 softirq=25673874/25673874 fqs=24 [351630.520318] NMI watchdog: Watchdog detected hard LOCKUP on cpu 61Modules linked in: mgc(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksockln d(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sunrpc bridge stp llc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd amd_energy kvm_amd rdma_ucm(OE) kvm rdma_cm(OE) iw_cm(OE) irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ib_ipoib(OE) rapl pcspkr sp5100_tco ib_cm(OE) ccp ptdma k10temp i2c_piix4 acpi_ipmi ipmi_si ib_umad(OE) ipmi_devintf ipmi_msghandler acpi_cpufreq ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg crc32c _intel mlx5_core(OE) mlxfw(OE) pci_hyperv_intf tls ahci libahci psample mlxdevm(OE) igb libata i2c_algo_bit mlx_compat(OE) dca [351630.520379] CPU: 61 PID: 14 Comm: rcu_sched Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_9.x86_64 #1 [351630.520380] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023 [351630.520381] RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520383] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 e9 a9 68 ba 00 8b 37 81 fe 00 [351630.520385] RSP: 0018:ffffb57ac01efe60 EFLAGS: 00000002 [351630.520386] RAX: 0000000000000101 RBX: 0000000000000246 RCX: dead000000000200 [351630.520387] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffacbe5100 [351630.520389] RBP: ffffffffacc2e120 R08: ffff95106df73bb8 R09: 0000000000000384 [351630.520390] R10: 0000000000000001 R11: ffff95106df71dc4 R12: 0000000000000000 [351630.520391] R13: 0000000000000000 R14: ffffffffaaf800f0 R15: ffffb57ac01efec8 [351630.520392] FS: 0000000000000000(0000) GS:ffff95106df40000(0000) knlGS:0000000000000000 [351630.520393] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [351630.520394] CR2: 0000148ce0023008 CR3: 0000002167410002 CR4: 0000000000770ee0 [351630.520395] PKRU: 55555554 [351630.520396] Call Trace: [351630.520396] <NMI> [351630.520397] ? watchdog_overflow_callback.cold.7+0x1e/0x70 [351630.520398] ? __perf_event_overflow+0x52/0x100 [351630.520399] ? x86_pmu_handle_irq+0x12f/0x190 [351630.520400] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520401] ? __set_pte_vaddr+0x32/0x50 [351630.520402] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520403] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520404] ? ghes_copy_tofrom_phys+0xf9/0x250 [351630.520405] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520406] ? amd_pmu_handle_irq+0x46/0xc0 [351630.520407] ? perf_event_nmi_handler+0x2d/0x50 [351630.520408] ? nmi_handle+0x63/0x110 [351630.520409] ? default_do_nmi+0x49/0x110 [351630.520410] ? do_nmi+0x1af/0x220 [351630.520411] ? end_repeat_nmi+0x16/0x69 [351630.520412] ? rcu_exp_handler+0x70/0x70 [351630.520413] ? native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520414] ? native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520415] ? native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520416] </NMI> [351630.520417] _raw_spin_lock_irqsave+0x34/0x40 [351630.520418] force_qs_rnp+0x87/0x1d0 [351630.520419] rcu_gp_kthread+0x66e/0x8a0 [351630.520420] ? rcu_gp_cleanup+0x3b0/0x3b0 [351630.520421] kthread+0x134/0x150 [351630.520422] ? set_kthread_struct+0x50/0x50 [351630.520423] ret_from_fork+0x35/0x40 [351630.520424] Kernel panic - not syncing: Hard LOCKUP [351630.520425] CPU: 61 PID: 14 Comm: rcu_sched Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_9.x86_64 #1 [351630.520426] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023 [351630.520428] Call Trace: [351630.520428] <NMI> [351630.520429] dump_stack+0x41/0x60 [351630.520430] panic+0xe7/0x2ac [351630.520431] ? __switch_to_asm+0x11/0x80 [351630.520432] nmi_panic.cold.11+0xc/0xc [351630.520433] watchdog_overflow_callback.cold.7+0x5c/0x70 [351630.520434] __perf_event_overflow+0x52/0x100 [351630.520435] x86_pmu_handle_irq+0x12f/0x190 [351630.520436] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520437] ? __set_pte_vaddr+0x32/0x50 [351630.520438] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520439] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520440] ? ghes_copy_tofrom_phys+0xf9/0x250 [351630.520441] ? srso_alias_return_thunk+0x5/0xfcdfd [351630.520442] amd_pmu_handle_irq+0x46/0xc0 [351630.520443] perf_event_nmi_handler+0x2d/0x50 [351630.520444] nmi_handle+0x63/0x110 [351630.520445] default_do_nmi+0x49/0x110 [351630.520446] do_nmi+0x1af/0x220 [351630.520447] end_repeat_nmi+0x16/0x69 [351630.520448] RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520449] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 e9 a9 68 ba 00 8b 37 81 fe 00 [351630.520451] RSP: 0018:ffffb57ac01efe60 EFLAGS: 00000002 [351630.520453] RAX: 0000000000000101 RBX: 0000000000000246 RCX: dead000000000200 [351630.520454] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffacbe5100 [351630.520455] RBP: ffffffffacc2e120 R08: ffff95106df73bb8 R09: 0000000000000384 [351630.520456] R10: 0000000000000001 R11: ffff95106df71dc4 R12: 0000000000000000 [351630.520457] R13: 0000000000000000 R14: ffffffffaaf800f0 R15: ffffb57ac01efec8 [351630.520458] ? rcu_exp_handler+0x70/0x70 [351630.520459] ? native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520460] ? native_queued_spin_lock_slowpath+0x61/0x1c0 [351630.520461] </NMI> [351630.520462] _raw_spin_lock_irqsave+0x34/0x40 [351630.520463] force_qs_rnp+0x87/0x1d0 [351630.520464] rcu_gp_kthread+0x66e/0x8a0 [351630.520465] ? rcu_gp_cleanup+0x3b0/0x3b0 [351630.520466] kthread+0x134/0x150 [351630.520467] ? set_kthread_struct+0x50/0x50 [351630.520468] ret_from_fork+0x35/0x40