Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
1 client hit LBUG in 2.16.0-rc4 soak testing when soak has been running for about 4 days
[351555.810525] Lustre: 24940:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1 72.25.80.53@tcp:/sfa18k03/fid: [0x28003b736:0x8cbb:0x0]// may get corrupted (rc -108) [351555.810535] Lustre: 24940:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1 72.25.80.53@tcp:/sfa18k03/fid: [0x28003b736:0x6fd1:0x0]// may get corrupted (rc -108) [351555.946431] LustreError: 207604:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x28003b719:0x11158:0x0] error -108. [351556.013419] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp) [351556.031730] LustreError: 207604:0:(vvp_io.c:1921:vvp_io_init()) Skipped 33 previous similar messages [351615.967171] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [351649.110938] rcu: 39-...0: (1 GPs behind) idle=daa/1/0x4000000000000002 softirq=12024628/12024629 fqs=6631 [351649.120913] (detected by 39, t=93156 jiffies, g=52770953, q=137201) [351649.127464] NMI backtrace for cpu 39 [351649.131225] CPU: 39 PID: 0 Comm: swapper/39 Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_9.x86_64 #1 [351649.143279] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023 [351649.150986] Call Trace: [351649.153621] <IRQ> [351649.155824] dump_stack+0x41/0x60 [351649.159331] nmi_cpu_backtrace.cold.8+0x13/0x4f [351649.164055] ? lapic_can_unplug_cpu.cold.30+0x43/0x43 [351649.169300] nmi_trigger_cpumask_backtrace+0xe9/0xee [351649.174454] rcu_dump_cpu_stacks+0xc8/0xfc [351649.178743] rcu_sched_clock_irq.cold.101+0x33/0x215 [351649.183903] ? tick_sched_do_timer+0x50/0x50 [351649.188365] update_process_times+0x55/0x80 [351649.192740] tick_sched_handle+0x22/0x60 [351649.196848] tick_sched_timer+0x37/0x80 [351649.200871] __hrtimer_run_queues+0x101/0x280 [351649.205420] hrtimer_interrupt+0x100/0x220 [351649.209709] ? srso_alias_return_thunk+0x5/0xfcdfd [351649.214694] smp_apic_timer_interrupt+0x6a/0x130 [351649.219503] apic_timer_interrupt+0xf/0x20 [351649.223795] </IRQ> [351649.226082] RIP: 0010:cpuidle_enter_state+0xda/0x470 [351649.231238] Code: e8 bb 1c 98 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 4b 03 00 00 31 ff e8 3d 38 9f ff fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 24 01 00 00 49 63 d6 48 8b 4c 24 10 48 2b 0c 24 48 [351649.250278] RSP: 0018:ffffa1b50c8d3e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [351649.258071] RAX: ffff8caead9f3180 RBX: ffffffff854d1798 RCX: 000000000000001f [351649.265426] RDX: 00013fcaf8d20a6c RSI: 000000002db6db6d RDI: 0000000000000000 [351649.272785] RBP: ffff8c70b1d96c00 R08: 0000000000000002 R09: 0000000000032940 [351649.280143] R10: 00037fe4cad7b104 R11: ffff8caead9f1dc4 R12: 0000000000000002 [351649.287501] R13: ffffffff854d16a0 R14: 0000000000000002 R15: 0000000000000002 [351649.294862] ? cpuidle_enter_state+0xb5/0x470 [351649.299412] cpuidle_enter+0x2c/0x40 [351649.303179] do_idle+0x26f/0x2d0 [351649.306596] cpu_startup_entry+0x6f/0x80 [351649.310712] start_secondary+0x187/0x1d0 [351649.314829] secondary_startup_64_no_verify+0xd2/0xdb [351877.620015] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.53@tcp failed: rc = -107 [351877.620321] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete [351877.631905] LustreError: Skipped 1 previous similar message [351877.650117] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0003; in progress operations using this service will fail. [351877.670534] LustreError: 217484:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x2c003de64:0x158ce:0x0] error -108. [351877.676425] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp) [354508.337347] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: operation ost_read to node 172.25.80.52@tcp failed: rc = -116 [365488.929944] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.52@tcp failed: rc = -107 [365488.941844] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0001 (at 172.25.80.52@tcp) was lost; in progress operations using this service will wait for recovery to complete [365488.960111] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0001; in progress operations using this service will fail. [365488.975053] LustreError: 237455:0:(llite_lib.c:2018:ll_md_setattr()) md_setattr fails: rc = -108 [365488.975096] LustreError: 237462:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x28003b753:0x1e7b9:0x0] error -108. [365488.976372] LustreError: 235100:0:(file.c:262:ll_close_inode_openhandle()) sfa18k03-clilmv-ffff8c70b6dd8800: inode [0x28003b753:0x1e7f3:0x0] mdc close failed: rc = -108 [365488.978006] LustreError: 237557:0:(ldlm_resource.c:1149:ldlm_resource_complain()) sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: namespace resource [0x28003b753:0x1e7b9:0x0].0x0 (00000000adbbab83) refcount nonzero (1) after lock cleanup; forcing cleanup. [365488.982357] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp) [365489.044666] LustreError: 237462:0:(vvp_io.c:1921:vvp_io_init()) Skipped 2 previous similar messages [366333.556010] LustreError: sfa18k03-MDT0000-mdc-ffff8c70b6dd8800: operation ost_read to node 172.25.80.50@tcp failed: rc = -116 [366520.381859] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.53@tcp failed: rc = -107 [366520.381878] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete [366520.393781] LustreError: Skipped 6 previous similar messages [366520.412207] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0003; in progress operations using this service will fail. [366520.432818] LustreError: 246109:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x2c003deab:0x986f:0x0] error -108. [366520.432966] LustreError: 244654:0:(llite_lib.c:2018:ll_md_setattr()) md_setattr fails: rc = -108 [366520.434530] LustreError: 246176:0:(file.c:262:ll_close_inode_openhandle()) sfa18k03-clilmv-ffff8c70b6dd8800: inode [0x2c003deab:0x9703:0x0] mdc close failed: rc = -108 [366520.436661] LustreError: 246176:0:(ldlm_resource.c:1149:ldlm_resource_complain()) sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: namespace resource [0x2c003deab:0x9735:0x0].0x0 (0000000076151be1) refcount nonzero (1) after lock cleanup; forcing cleanup. [366520.436668] LustreError: 246176:0:(ldlm_resource.c:1149:ldlm_resource_complain()) Skipped 1 previous similar message [366520.444818] LustreError: 246109:0:(vvp_io.c:1921:vvp_io_init()) Skipped 49 previous similar messages [366520.445680] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp) [366520.453850] LustreError: 244654:0:(llite_lib.c:2018:ll_md_setattr()) Skipped 1 previous similar message [366522.165525] LustreError: 24960:0:(osc_cache.c:1098:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: last_oap_count -61 [366522.177881] LustreError: 24960:0:(osc_cache.c:1098:osc_extent_make_ready()) LBUG [366522.185536] CPU: 11 PID: 24960 Comm: ptlrpcd_05_00 Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_9.x86_64 #1 [366522.198214] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023 [366522.205928] Call Trace: [366522.208576] dump_stack+0x41/0x60 [366522.212091] lbug_with_loc.cold.8+0x5/0x58 [libcfs] [366522.217189] osc_extent_make_ready+0xb8c/0xd00 [osc] [366522.222383] osc_check_rpcs+0x1487/0x1980 [osc] [366522.227139] ? srso_alias_return_thunk+0x5/0xfcdfd [366522.232128] ? set_next_entity+0xb5/0x1e0 [366522.236342] ? srso_alias_return_thunk+0x5/0xfcdfd [366522.241331] ? __switch_to+0x10c/0x430 [366522.245277] ? srso_alias_return_thunk+0x5/0xfcdfd [366522.250266] ? __schedule+0x2d9/0x870 [366522.254125] osc_io_unplug0+0xc0/0x110 [osc] [366522.258614] brw_queue_work+0x2e/0xd0 [osc] [366522.263015] work_interpreter+0x32/0x110 [ptlrpc] [366522.267981] ptlrpc_check_set+0x494/0x1ed0 [ptlrpc] [366522.273126] ptlrpcd+0x85b/0xa70 [ptlrpc] [366522.277405] ? do_wait_intr_irq+0xa0/0xa0 [366522.281618] ? ptlrpcd_add_req+0x2f0/0x2f0 [ptlrpc] [366522.286763] kthread+0x134/0x150 [366522.290184] ? set_kthread_struct+0x50/0x50 [366522.294571] ret_from_fork+0x35/0x40 [366528.300246] Kernel panic - not syncing: LBUG [366528.304722] CPU: 42 PID: 24960 Comm: ptlrpcd_05_00 Kdump: loaded Tainted: G OE --------- - - 4.18.0-513.24.1.el8_9.x86_64 #1 [366528.326846] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023 [366528.334559] Call Trace: [366528.337201] dump_stack+0x41/0x60 [366528.340724] panic+0xe7/0x2ac [366528.343890] lbug_with_loc.cold.8+0x2f/0x58 [libcfs] [366528.349069] osc_extent_make_ready+0xb8c/0xd00 [osc] [366528.354266] osc_check_rpcs+0x1487/0x1980 [osc] [366528.359020] ? srso_alias_return_thunk+0x5/0xfcdfd [366528.364012] ? set_next_entity+0xb5/0x1e0 [366528.368224] ? srso_alias_return_thunk+0x5/0xfcdfd [366528.373214] ? __switch_to+0x10c/0x430 [366528.377167] ? srso_alias_return_thunk+0x5/0xfcdfd [366528.382158] ? __schedule+0x2d9/0x870 [366528.386016] osc_io_unplug0+0xc0/0x110 [osc] [366528.390506] brw_queue_work+0x2e/0xd0 [osc] [366528.394908] work_interpreter+0x32/0x110 [ptlrpc] [366528.399873] ptlrpc_check_set+0x494/0x1ed0 [ptlrpc] [366528.405017] ptlrpcd+0x85b/0xa70 [ptlrpc] [366528.409294] ? do_wait_intr_irq+0xa0/0xa0 [366528.413508] ? ptlrpcd_add_req+0x2f0/0x2f0 [ptlrpc] [366528.418646] kthread+0x134/0x150 [366528.422078] ? set_kthread_struct+0x50/0x50 [366528.426461] ret_from_fork+0x35/0x40
server and client version: 2.16.0-RC4
coredump can be found on
[root@wr-es-33 127.0.0.1-2024-10-28-01:37:41]# pwd /var/crash/127.0.0.1-2024-10-28-01:37:41