Details
Description
One of our production filesystems that is running Lustre-2.12.5 crashed 3 times over the weekend. The first two events happened late at night, so I didn't capture a crash dump. For both of these outages, the server would crash again if I allowed recovery to take place, so I had to abort recovery. For the third crash, I allowed a crash dump to complete. For all 3 of the cases I have a stack trace (see below). The first problem appeared to just be heavy load (though I’m not an expert on the traces) – see Trace #1 below. However, the second time, the trace had mention of grant issues (see trace #2 below). There were also messages in the log like the following:
Nov 21 02:29:04 f2-oss1a6 kernel: LustreError: 20860:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli a3dbc252-8c91-c91f-230d-2a2f81a4a245 claims 1703936 GRANT, real grant 0
James had mentioned last week that there was a grant patch that was needed for both the servers and clients in Lustre-2.12.5, but James mentioned that this problem we were seeing may or may not be related, and it would be a good idea to create a ticket. We are not sure if these different crashes are all related or not...
Can we please get another set of eyes on the issue so we can work towards a resolution? These 3 outages were impactful to our users over the weekend.
Thanks,
Dustin
############# TRACE #1 ############### [2974912.706991] Kernel panic - not syncing: Hard LOCKUP [2974912.706993] CPU: 22 PID: 0 Comm: swapper/22 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [2974912.706994] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [2974912.706994] Call Trace: [2974912.706997] <NMI> [<ffffffffba17ac23>] dump_stack+0x19/0x1b [2974912.706999] [<ffffffffba174967>] panic+0xe8/0x21f [2974912.707006] [<ffffffffb9a9b6af>] nmi_panic+0x3f/0x40 [2974912.707009] [<ffffffffb9b4eb81>] watchdog_overflow_callback+0x121/0x140 [2974912.707011] [<ffffffffb9ba8277>] __perf_event_overflow+0x57/0x100 [2974912.707014] [<ffffffffb9bb1a14>] perf_event_overflow+0x14/0x20 [2974912.707016] [<ffffffffb9a0ac70>] handle_pmi_common+0x1a0/0x250 [2974912.707028] [<ffffffffb9a0af4f>] intel_pmu_handle_irq+0xcf/0x1d0 [2974912.707030] [<ffffffffba184031>] perf_event_nmi_handler+0x31/0x50 [2974912.707032] [<ffffffffba18593c>] nmi_handle.isra.0+0x8c/0x150 [2974912.707034] [<ffffffffba185c18>] do_nmi+0x218/0x460 [2974912.707036] [<ffffffffba184d9c>] end_repeat_nmi+0x1e/0x81 [2974912.707043] <EOE> <IRQ> [<ffffffffba17544a>] queued_spin_lock_slowpath+0xb/0xf [2974912.707044] [<ffffffffba1833b7>] _raw_spin_lock_irqsave+0x37/0x40 [2974912.707046] [<ffffffffb9b5b769>] rcu_check_callbacks+0x549/0x70 [2974912.707050] [<ffffffffb9aaf536>] update_process_times+0x46/0x80 [2974912.707051] [<ffffffffb9b0fcb0>] tick_sched_handle+0x30/0x70 [2974912.707052] [<ffffffffb9b0ff79>] tick_sched_timer+0x39/0x80 [2974912.707054] [<ffffffffb9aca5ee>] __hrtimer_run_queues+0x10e/0x270 [2974912.707056] [<ffffffffb9acab4f>] hrtimer_interrupt+0xaf/0x1d0 [2974912.707058] [<ffffffffb9a5c60b>] local_apic_timer_interrupt+0x3b/0x60 [2974912.707060] [<ffffffffba1929d3>] smp_apic_timer_interrupt+0x43/0x60 [2974912.707061] [<ffffffffba18eefa>] apic_timer_interrupt+0x16a/0x170 [2974912.707064] [<ffffffffb9fc1b5e>] cpuidle_idle_call+0xde/0x230 [2974912.707066] [<ffffffffb9a37c6e>] arch_cpu_idle+0xe/0xc0 [2974912.707067] [<ffffffffb9b0158a>] cpu_startup_entry+0x14a/0x1e0 [2974912.707069] [<ffffffffb9a5a0b7>] start_secondary+0x1f7/0x270 [2974912.707070] [<ffffffffb9a000d5>] start_cpu+0x5/0x14 ####################### TRACE #2 ######################## [ 5378.059559] Kernel panic - not syncing: LBUG [ 5378.059561] Pid: 44556, comm: ll_ost01_079 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 [ 5378.059563] Call Trace: [ 5378.059582] [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 5378.059585] [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 5378.059659] [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc] [ 5378.059668] [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd] [ 5378.059685] [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc] [ 5378.059711] [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc] [ 5378.059735] [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [ 5378.059755] [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5378.059775] [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [ 5378.059779] [<ffffffff8e6c61f1>] kthread+0xd1/0xe0 [ 5378.059784] [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 5378.164169] CPU: 10 PID: 23550 Comm: ll_ost00_022 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [ 5378.177532] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [ 5378.185781] Call Trace: [ 5378.188925] [<ffffffff8ed7ac23>] dump_stack+0x19/0x1b [ 5378.194743] [<ffffffff8ed74967>] panic+0xe8/0x21f [ 5378.200189] [<ffffffffc0b9c8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 5378.207052] [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc] [ 5378.214824] [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd] [ 5378.237175] [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc] [ 5378.245008] [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc] [ 5378.251976] [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [ 5378.275204] [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5378.296541] [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [ 5378.317273] [<ffffffff8e6c61f1>] kthread+0xd1/0xe0 [ 5378.329273] [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21 ####################### TRACE #3 ######################## [59064.267110] Kernel panic - not syncing: Hard LOCKUP [59064.267111] CPU: 15 PID: 21200 Comm: ll_ost01_015 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1 [59064.267112] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018 [59064.267112] Call Trace: [59064.267115] <NMI> [<ffffffffb857ac23>] dump_stack+0x19/0x1b [59064.267117] [<ffffffffb8574967>] panic+0xe8/0x21f [59064.267122] [<ffffffffb7e9b6af>] nmi_panic+0x3f/0x40 [59064.267124] [<ffffffffb7f4eb81>] watchdog_overflow_callback+0x121/0x140 [59064.267127] [<ffffffffb7fa8277>] __perf_event_overflow+0x57/0x100 [59064.267129] [<ffffffffb7fb1a14>] perf_event_overflow+0x14/0x20 [59064.267131] [<ffffffffb7e0ac70>] handle_pmi_common+0x1a0/0x250 [59064.267142] [<ffffffffb7e0af4f>] intel_pmu_handle_irq+0xcf/0x1d0 [59064.267144] [<ffffffffb8584031>] perf_event_nmi_handler+0x31/0x50 [59064.267146] [<ffffffffb858593c>] nmi_handle.isra.0+0x8c/0x150 [59064.267147] [<ffffffffb8585c18>] do_nmi+0x218/0x460 [59064.267149] [<ffffffffb8584d9c>] end_repeat_nmi+0x1e/0x81 [59064.267155] <EOE> <IRQ> [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf [59064.267156] [<ffffffffb85833b7>] _raw_spin_lock_irqsave+0x37/0x40 [59064.267158] [<ffffffffb7f57f5b>] rcu_dump_cpu_stacks+0x4b/0xd0 [59064.267160] [<ffffffffb7f5b662>] rcu_check_callbacks+0x442/0x730 [59064.267164] [<ffffffffb7eaf536>] update_process_times+0x46/0x80 [59064.267166] [<ffffffffb7f0fcb0>] tick_sched_handle+0x30/0x70 [59064.267167] [<ffffffffb7f0ff79>] tick_sched_timer+0x39/0x80 [59064.267169] [<ffffffffb7eca5ee>] __hrtimer_run_queues+0x10e/0x270 [59064.267171] [<ffffffffb7ecab4f>] hrtimer_interrupt+0xaf/0x1d0 [59064.267173] [<ffffffffb7e5c60b>] local_apic_timer_interrupt+0x3b/0x60 [59064.267175] [<ffffffffb85929d3>] smp_apic_timer_interrupt+0x43/0x60 [59064.267176] [<ffffffffb858eefa>] apic_timer_interrupt+0x16a/0x170 [59064.267180] [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf [59064.267181] [<ffffffffb8583330>] _raw_spin_lock+0x20/0x30 [59064.267194] [<ffffffffc173902c>] lock_res_and_lock+0x2c/0x50 [ptlrpc] [59064.267208] [<ffffffffc1740c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc] [59064.267245] [<ffffffffc1769516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [59064.267288] [<ffffffffc17f1bd2>] tgt_enqueue+0x62/0x210 [ptlrpc] [59064.267311] [<ffffffffc17f89ea>] tgt_request_handle+0xada/0x1570 [ptlrpc] [59064.267356] [<ffffffffc179d48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [59064.267396] [<ffffffffc17a0df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [59064.267417] [<ffffffffb7ec61f1>] kthread+0xd1/0xe0