[LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1 Created: 20/Jan/21 Updated: 31/Dec/23 Resolved: 05/May/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Olaf Faaland | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | llnl | ||
| Environment: |
kernel-3.10.0-1160.11.1 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
hard LOCKUP and panic. Most frequently observed on OSTs after mdtest completes or after OST mount , a few seconds after "deleting orphan objects from" console log messages. This appeared to be due to kernel timer behavior changes introduced between kernel-3.10.0-1160.6.1 and kernel-3.10.0-1160.11.1. Fix in progress. See https://bugzilla.redhat.com/show_bug.cgi?id=1914011 For brevity, only the bottoms of the stacks, are listed below. Kernel panic - not syncing: Hard LOCKUP CPU: 14 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: P W OE ------------ 3.10.0-1160.11.1.1chaos.ch6.x86_64 #1 ... Call Trace: <NMI> [<ffffffffa47ae072>] dump_stack+0x19/0x1b [<ffffffffa47a71e7>] panic+0xe8/0x21f ... [<ffffffffa40b1edc>] ? run_timer_softirq+0xbc/0x370 <EOE> <IRQ> [<ffffffffa40a82fd>] __do_softirq+0xfd/0x2c0 [<ffffffffa47c56ec>] call_softirq+0x1c/0x30 [<ffffffffa4030995>] do_softirq+0x65/0xa0 [<ffffffffa40a86d5>] irq_exit+0x105/0x110 [<ffffffffa47c6c88>] smp_apic_timer_interrupt+0x48/0x60 [<ffffffffa47c31ba>] apic_timer_interrupt+0x16a/0x170 <EOI> [<ffffffffa40b3113>] ? get_next_timer_interrupt+0x103/0x270 [<ffffffffa45eace7>] ? cpuidle_enter_state+0x57/0xd0 [<ffffffffa45eae3e>] cpuidle_idle_call+0xde/0x270 [<ffffffffa403919e>] arch_cpu_idle+0xe/0xc0 [<ffffffffa410856a>] cpu_startup_entry+0x14a/0x1e0 [<ffffffffa405cbb7>] start_secondary+0x207/0x280 [<ffffffffa40000d5>] start_cpu+0x5/0x14 Another one that we see is quite similar to what is happening on cpu 21 in the original BZ. Call Trace: <NMI> [<ffffffff85fae072>] dump_stack+0x19/0x1b [<ffffffff85fa71e7>] panic+0xe8/0x21f ... [<ffffffff8591f4e8>] ? native_queued_spin_lock_slowpath+0x158/0x200 <EOE> [<ffffffff85fa7dd2>] queued_spin_lock_slowpath+0xb/0xf [<ffffffff85fb7197>] _raw_spin_lock_irqsave+0x47/0x50 [<ffffffff858b1b8b>] lock_timer_base.isra.38+0x2b/0x50 [<ffffffff858b244f>] try_to_del_timer_sync+0x2f/0x90 [<ffffffff858b2502>] del_timer_sync+0x52/0x60 [<ffffffff85fb1920>] schedule_timeout+0x180/0x320 [<ffffffff858b1870>] ? requeue_timers+0x1f0/0x1f0 |
| Comments |
| Comment by Olaf Faaland [ 20/Jan/21 ] |
|
For my reference, my local ticket is TOSS-4985 |
| Comment by Olaf Faaland [ 20/Jan/21 ] |
|
Redhat has a fix which we tested successfully on our system. The patch is moving through their system. See their bugzilla for details. |
| Comment by Stephane Thiell [ 20/Jan/21 ] |
|
We are seeing the same hard lockups on new servers (Dell R6525) with Lustre 2.12.6 on CentOS 7.9 using kernel 3.10.0-1160.11.1.el7 + Lustre patches for rhel7.9. It's very easy to trigger, just one client doing some moderate I/Os. I was just about to report it too! Olaf was faster Example of crash dump: crash> ps | grep ">" ... > 4588 2 51 ffff9c8d1dbed280 RU 0.0 0 0 [ll_ost01_000] > 4589 2 28 ffff9c8d1dbee300 RU 0.0 0 0 [ll_ost01_001] > 4590 2 60 ffff9c8d2dfc8000 RU 0.0 0 0 [ll_ost01_002] > 4651 2 50 ffff9c8d15ea5280 RU 0.0 0 0 [ll_ost01_005] > 4655 2 18 ffff9c73e47f8000 RU 0.0 0 0 [ll_ost01_008] > 4656 2 17 ffff9c73e47fd280 RU 0.0 0 0 [ll_ost01_009] > 4663 2 16 ffff9c8d349ab180 RU 0.0 0 0 [ll_ost_io01_006] I don't have access to this RedHat bz, but I can also see "lock_timer_base" in one of the backtrace (for 4588 ll_ost01_000): crash> bt 4588
PID: 4588 TASK: ffff9c8d1dbed280 CPU: 51 COMMAND: "ll_ost01_000"
#0 [ffff9cb1ff0c8b58] crash_nmi_callback at ffffffffad85855b
#1 [ffff9cb1ff0c8b68] nmi_panic_self_stop at ffffffffad858c86
#2 [ffff9cb1ff0c8b80] nmi_panic at ffffffffad89aefb
#3 [ffff9cb1ff0c8b90] watchdog_overflow_callback at ffffffffad94ef81
#4 [ffff9cb1ff0c8ba8] __perf_event_overflow at ffffffffad9a8b07
#5 [ffff9cb1ff0c8be0] perf_event_overflow at ffffffffad9b2304
#6 [ffff9cb1ff0c8bf0] x86_pmu_handle_irq at ffffffffad805595
#7 [ffff9cb1ff0c8e20] amd_pmu_handle_irq at ffffffffad8066f5
#8 [ffff9cb1ff0c8e38] perf_event_nmi_handler at ffffffffadf8b031
#9 [ffff9cb1ff0c8e58] nmi_handle at ffffffffadf8c93c
#10 [ffff9cb1ff0c8eb0] do_nmi at ffffffffadf8cb5d
#11 [ffff9cb1ff0c8ef0] end_repeat_nmi at ffffffffadf8bd9c
[exception RIP: native_queued_spin_lock_slowpath+464]
RIP: ffffffffad917920 RSP: ffff9c8d2dfbf6e8 RFLAGS: 00000002
RAX: 0000000000810101 RBX: 0000000000000282 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9cb1ff013940
RBP: ffff9c8d2dfbf6e8 R8: 0000000000000101 R9: 00000000000000c0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cb1ff013940
R13: ffffffffc158a138 R14: ffff9c8d2dfbf758 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff9c8d2dfbf6e8] native_queued_spin_lock_slowpath at ffffffffad917920
#13 [ffff9c8d2dfbf6f0] queued_spin_lock_slowpath at ffffffffadf7bfe5
#14 [ffff9c8d2dfbf700] _raw_spin_lock_irqsave at ffffffffadf8a777
#15 [ffff9c8d2dfbf718] lock_timer_base at ffffffffad8adbeb
#16 [ffff9c8d2dfbf748] mod_timer at ffffffffad8aed14
#17 [ffff9c8d2dfbf798] __ldlm_add_waiting_lock at ffffffffc146fac7 [ptlrpc]
#18 [ffff9c8d2dfbf7c0] ldlm_add_waiting_lock at ffffffffc146ffb3 [ptlrpc]
#19 [ffff9c8d2dfbf7e8] ldlm_server_blocking_ast at ffffffffc147533c [ptlrpc]
#20 [ffff9c8d2dfbf828] tgt_blocking_ast at ffffffffc15018a9 [ptlrpc]
#21 [ffff9c8d2dfbf8e8] ldlm_work_bl_ast_lock at ffffffffc144a26c [ptlrpc]
#22 [ffff9c8d2dfbf970] ptlrpc_set_wait at ffffffffc1490f42 [ptlrpc]
#23 [ffff9c8d2dfbfa18] ldlm_run_ast_work at ffffffffc144e145 [ptlrpc]
#24 [ffff9c8d2dfbfa48] ldlm_handle_conflict_lock at ffffffffc144ef50 [ptlrpc]
#25 [ffff9c8d2dfbfa80] ldlm_lock_enqueue at ffffffffc144f5e0 [ptlrpc]
#26 [ffff9c8d2dfbfaf8] ldlm_cli_enqueue_local at ffffffffc146a227 [ptlrpc]
#27 [ffff9c8d2dfbfb98] ofd_destroy_by_fid at ffffffffc1bafb21 [ofd]
#28 [ffff9c8d2dfbfc70] ofd_destroy_hdl at ffffffffc1ba6797 [ofd]
#29 [ffff9c8d2dfbfcd0] tgt_request_handle at ffffffffc1506f1a [ptlrpc]
#30 [ffff9c8d2dfbfd58] ptlrpc_server_handle_request at ffffffffc14ab88b [ptlrpc]
#31 [ffff9c8d2dfbfdf8] ptlrpc_main at ffffffffc14af1f4 [ptlrpc]
#32 [ffff9c8d2dfbfec8] kthread at ffffffffad8c5e71
#33 [ffff9c8d2dfbff50] ret_from_fork_nospec_begin at ffffffffadf94ddd
|
| Comment by Olaf Faaland [ 20/Jan/21 ] |
|
Related to |
| Comment by Peter Jones [ 20/Jan/21 ] |
|
Olaf Do I read this correctly that you are mostly just opening this ticket for the benefit of others and not expecting action on our part at this time? Peter |
| Comment by Olaf Faaland [ 20/Jan/21 ] |
|
Peter, Sorry I didn't see your comment earlier. Yes, this ticket is for the benefit of others. |
| Comment by Stephane Thiell [ 08/Feb/21 ] |
|
FYI I tried Lustre (server) with kernel-3.10.0-1160.15.2.el7 and immediately got the same hard lockups, so it looks like this issue hasn't been fixed by Redhat yet... Does anyone know if this incompatibility with Lustre since 3.10.0-1160.11.1 was introduced by the following patch? Loop in __run_timers() because base->timer_jiffies is very far behind causes a lockup condition. (BZ#1849716) |
| Comment by Jeremy Filizetti [ 08/Feb/21 ] |
|
My investigation found it was introduced in 3.10.0-1160.8.1.el7 by
|
| Comment by Andreas Dilger [ 23/Feb/21 ] |
|
What is the commit hash for that patch? I can't find it in the upstream kernel. |
| Comment by Jeremy Filizetti [ 23/Feb/21 ] |
|
In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue. |
| Comment by Andreas Dilger [ 23/Feb/21 ] |
|
Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem. |
| Comment by Jeremy Filizetti [ 27/Feb/21 ] |
|
This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet. |
| Comment by Jeremy Filizetti [ 02/Mar/21 ] |
|
FWIW I have not seen any issues since upgrading to the hotfix kernel. |
| Comment by James A Simmons [ 02/Mar/21 ] |
|
Which kernel did this start showing up? |
| Comment by Jeremy Filizetti [ 02/Mar/21 ] |
|
3.10.0-1160.8.1.el7 |
| Comment by Gerrit Updater [ 03/Mar/21 ] |
|
The patch for master branch needs to be incorporated into https://review.whamcloud.com/41822. |
| Comment by Karsten Weiss [ 04/Mar/21 ] |
|
I've noticed that |
| Comment by Andreas Dilger [ 04/Mar/21 ] |
|
knweiss those users will have to patch their kernel anyway, or stick with 3.10.0-1160-6.1.el7 or earlier until RHEL fixes the bug. There isn't anything that can be done in Lustre to avoid this, since it is in a core part of the kernel. Since the affected code was working fine for many years without the 1160.8.1 change, there is no reason to expect that reverting it will cause any problems. |
| Comment by Karsten Weiss [ 05/May/21 ] |
|
Olaf, since a couple of weeks passed may I ask: Did anything happen on Red Hat's side since you've opened Red Hat bug #1914011? Unfortunately, I'm not authorized to access it myself. Is Red Hat aware that their timer change causes this regression? Andreas, wouldn't it be a good idea to at least mention |
| Comment by Andreas Dilger [ 05/May/21 ] |
|
Karsten, feel free to submit a patch to that effect. Unlike some issues, I don't think there is anything we can do in the Lustre code to avoid the use of kernel timers completely (e.g. make our own "cfs_*" version of some function and call that instead) because the timers are used all over the kernel and the lockups don't appear to be in code that Lustre is calling directly. |
| Comment by Olaf Faaland [ 05/May/21 ] |
|
Karsten, it looks to me like this is fixed in kernel-3.10.0-1160.20.1.el7 |
| Comment by Jian Yu [ 05/May/21 ] |
|
Yes, the issue was fixed in kernel 3.10.0-1160.20.1.el7: - [kernel] timer: Fix potential bug in requeue_timers() (Waiman Long) [1914011] And in https://review.whamcloud.com/42089 for Lustre 2.12.7, the revert-fix-lockup-in-run_timers-large-jiffies-delta.patch was removed.
|
| Comment by Andreas Dilger [ 05/May/21 ] |
|
Remove the revert patch was landed as part of |
| Comment by Karsten Weiss [ 06/May/21 ] |
|
Awesome, thank you all! (Sorry for the confusion but for whatever reason my b2_12 checkout was not up-to-date when I checked the status yesterday and thus I missed the decisive commit. I'm looking forward to test Lustre 2.12.7 with (patchless) kernel 3.10.0-1160.21.1.el7 next!) |