Details
Description
One of our production filesystems that is running Lustre-2.12.5 crashed 3 times over the weekend. The first two events happened late at night, so I didn't capture a crash dump. For both of these outages, the server would crash again if I allowed recovery to take place, so I had to abort recovery. For the third crash, I allowed a crash dump to complete. For all 3 of the cases I have a stack trace (see below). The first problem appeared to just be heavy load (though I’m not an expert on the traces) – see Trace #1 below. However, the second time, the trace had mention of grant issues (see trace #2 below). There were also messages in the log like the following:
Nov 21 02:29:04 f2-oss1a6 kernel: LustreError: 20860:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli a3dbc252-8c91-c91f-230d-2a2f81a4a245 claims 1703936 GRANT, real grant 0
James had mentioned last week that there was a grant patch that was needed for both the servers and clients in Lustre-2.12.5, but James mentioned that this problem we were seeing may or may not be related, and it would be a good idea to create a ticket. We are not sure if these different crashes are all related or not...
Can we please get another set of eyes on the issue so we can work towards a resolution? These 3 outages were impactful to our users over the weekend.
Thanks,
Dustin
############# TRACE #1 ############### [2974912.706991] Kernel panic - not syncing: Hard LOCKUP [2974912.706993] CPU: 22 PID: 0 Comm: swapper/22 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [2974912.706994] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [2974912.706994] Call Trace: [2974912.706997] <NMI> [<ffffffffba17ac23>] dump_stack+0x19/0x1b [2974912.706999] [<ffffffffba174967>] panic+0xe8/0x21f [2974912.707006] [<ffffffffb9a9b6af>] nmi_panic+0x3f/0x40 [2974912.707009] [<ffffffffb9b4eb81>] watchdog_overflow_callback+0x121/0x140 [2974912.707011] [<ffffffffb9ba8277>] __perf_event_overflow+0x57/0x100 [2974912.707014] [<ffffffffb9bb1a14>] perf_event_overflow+0x14/0x20 [2974912.707016] [<ffffffffb9a0ac70>] handle_pmi_common+0x1a0/0x250 [2974912.707028] [<ffffffffb9a0af4f>] intel_pmu_handle_irq+0xcf/0x1d0 [2974912.707030] [<ffffffffba184031>] perf_event_nmi_handler+0x31/0x50 [2974912.707032] [<ffffffffba18593c>] nmi_handle.isra.0+0x8c/0x150 [2974912.707034] [<ffffffffba185c18>] do_nmi+0x218/0x460 [2974912.707036] [<ffffffffba184d9c>] end_repeat_nmi+0x1e/0x81 [2974912.707043] <EOE> <IRQ> [<ffffffffba17544a>] queued_spin_lock_slowpath+0xb/0xf [2974912.707044] [<ffffffffba1833b7>] _raw_spin_lock_irqsave+0x37/0x40 [2974912.707046] [<ffffffffb9b5b769>] rcu_check_callbacks+0x549/0x70 [2974912.707050] [<ffffffffb9aaf536>] update_process_times+0x46/0x80 [2974912.707051] [<ffffffffb9b0fcb0>] tick_sched_handle+0x30/0x70 [2974912.707052] [<ffffffffb9b0ff79>] tick_sched_timer+0x39/0x80 [2974912.707054] [<ffffffffb9aca5ee>] __hrtimer_run_queues+0x10e/0x270 [2974912.707056] [<ffffffffb9acab4f>] hrtimer_interrupt+0xaf/0x1d0 [2974912.707058] [<ffffffffb9a5c60b>] local_apic_timer_interrupt+0x3b/0x60 [2974912.707060] [<ffffffffba1929d3>] smp_apic_timer_interrupt+0x43/0x60 [2974912.707061] [<ffffffffba18eefa>] apic_timer_interrupt+0x16a/0x170 [2974912.707064] [<ffffffffb9fc1b5e>] cpuidle_idle_call+0xde/0x230 [2974912.707066] [<ffffffffb9a37c6e>] arch_cpu_idle+0xe/0xc0 [2974912.707067] [<ffffffffb9b0158a>] cpu_startup_entry+0x14a/0x1e0 [2974912.707069] [<ffffffffb9a5a0b7>] start_secondary+0x1f7/0x270 [2974912.707070] [<ffffffffb9a000d5>] start_cpu+0x5/0x14 ####################### TRACE #2 ######################## [ 5378.059559] Kernel panic - not syncing: LBUG [ 5378.059561] Pid: 44556, comm: ll_ost01_079 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 [ 5378.059563] Call Trace: [ 5378.059582] [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 5378.059585] [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 5378.059659] [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc] [ 5378.059668] [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd] [ 5378.059685] [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc] [ 5378.059711] [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc] [ 5378.059735] [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [ 5378.059755] [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5378.059775] [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [ 5378.059779] [<ffffffff8e6c61f1>] kthread+0xd1/0xe0 [ 5378.059784] [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 5378.164169] CPU: 10 PID: 23550 Comm: ll_ost00_022 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [ 5378.177532] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [ 5378.185781] Call Trace: [ 5378.188925] [<ffffffff8ed7ac23>] dump_stack+0x19/0x1b [ 5378.194743] [<ffffffff8ed74967>] panic+0xe8/0x21f [ 5378.200189] [<ffffffffc0b9c8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 5378.207052] [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc] [ 5378.214824] [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd] [ 5378.237175] [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc] [ 5378.245008] [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc] [ 5378.251976] [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [ 5378.275204] [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5378.296541] [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [ 5378.317273] [<ffffffff8e6c61f1>] kthread+0xd1/0xe0 [ 5378.329273] [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21 ####################### TRACE #3 ######################## [59064.267110] Kernel panic - not syncing: Hard LOCKUP [59064.267111] CPU: 15 PID: 21200 Comm: ll_ost01_015 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [59064.267112] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [59064.267112] Call Trace: [59064.267115] <NMI> [<ffffffffb857ac23>] dump_stack+0x19/0x1b [59064.267117] [<ffffffffb8574967>] panic+0xe8/0x21f [59064.267122] [<ffffffffb7e9b6af>] nmi_panic+0x3f/0x40 [59064.267124] [<ffffffffb7f4eb81>] watchdog_overflow_callback+0x121/0x140 [59064.267127] [<ffffffffb7fa8277>] __perf_event_overflow+0x57/0x100 [59064.267129] [<ffffffffb7fb1a14>] perf_event_overflow+0x14/0x20 [59064.267131] [<ffffffffb7e0ac70>] handle_pmi_common+0x1a0/0x250 [59064.267142] [<ffffffffb7e0af4f>] intel_pmu_handle_irq+0xcf/0x1d0 [59064.267144] [<ffffffffb8584031>] perf_event_nmi_handler+0x31/0x50 [59064.267146] [<ffffffffb858593c>] nmi_handle.isra.0+0x8c/0x150 [59064.267147] [<ffffffffb8585c18>] do_nmi+0x218/0x460 [59064.267149] [<ffffffffb8584d9c>] end_repeat_nmi+0x1e/0x81 [59064.267155] <EOE> <IRQ> [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf [59064.267156] [<ffffffffb85833b7>] _raw_spin_lock_irqsave+0x37/0x40 [59064.267158] [<ffffffffb7f57f5b>] rcu_dump_cpu_stacks+0x4b/0xd0 [59064.267160] [<ffffffffb7f5b662>] rcu_check_callbacks+0x442/0x730 [59064.267164] [<ffffffffb7eaf536>] update_process_times+0x46/0x80 [59064.267166] [<ffffffffb7f0fcb0>] tick_sched_handle+0x30/0x70 [59064.267167] [<ffffffffb7f0ff79>] tick_sched_timer+0x39/0x80 [59064.267169] [<ffffffffb7eca5ee>] __hrtimer_run_queues+0x10e/0x270 [59064.267171] [<ffffffffb7ecab4f>] hrtimer_interrupt+0xaf/0x1d0 [59064.267173] [<ffffffffb7e5c60b>] local_apic_timer_interrupt+0x3b/0x60 [59064.267175] [<ffffffffb85929d3>] smp_apic_timer_interrupt+0x43/0x60 [59064.267176] [<ffffffffb858eefa>] apic_timer_interrupt+0x16a/0x170 [59064.267180] [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf [59064.267181] [<ffffffffb8583330>] _raw_spin_lock+0x20/0x30 [59064.267194] [<ffffffffc173902c>] lock_res_and_lock+0x2c/0x50 [ptlrpc] [59064.267208] [<ffffffffc1740c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc] [59064.267245] [<ffffffffc1769516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [59064.267288] [<ffffffffc17f1bd2>] tgt_enqueue+0x62/0x210 [ptlrpc] [59064.267311] [<ffffffffc17f89ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [59064.267356] [<ffffffffc179d48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [59064.267396] [<ffffffffc17a0df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [59064.267417] [<ffffffffb7ec61f1>] kthread+0xd1/0xe0
Olaf,
Looks like one of them displayed what you suggest just prior to the LBUG:
Another other one seems to have run out of grant space and then gone into a soft CPU lockup that progressed into a hard lockup, causing the crash.
And the other maybe looks like it had some network issues or something? The only grant issues are hours prior to the crash.
These are running 2.12.5 + patches should be LU-12136-38428 and
LU-11276. James Simmons can provide better details on the exact patches in that build, but I think that's accurate.