[LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
kernel-3.10.0-1160.11.1
lustre-2.12.6_2.llnl-1.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

hard LOCKUP and panic. Most frequently observed on OSTs after mdtest completes or after OST mount , a few seconds after "deleting orphan objects from" console log messages.

This appeared to be due to kernel timer behavior changes introduced between kernel-3.10.0-1160.6.1 and kernel-3.10.0-1160.11.1.

Fix in progress. See https://bugzilla.redhat.com/show_bug.cgi?id=1914011

For brevity, only the bottoms of the stacks, are listed below.

Kernel panic - not syncing: Hard LOCKUP
CPU: 14 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-1160.11.1.1chaos.ch6.x86_64 #1
...
Call Trace:
 <NMI>  [<ffffffffa47ae072>] dump_stack+0x19/0x1b
 [<ffffffffa47a71e7>] panic+0xe8/0x21f
...
 [<ffffffffa40b1edc>] ? run_timer_softirq+0xbc/0x370
 <EOE>  <IRQ>  [<ffffffffa40a82fd>] __do_softirq+0xfd/0x2c0
 [<ffffffffa47c56ec>] call_softirq+0x1c/0x30
 [<ffffffffa4030995>] do_softirq+0x65/0xa0
 [<ffffffffa40a86d5>] irq_exit+0x105/0x110
 [<ffffffffa47c6c88>] smp_apic_timer_interrupt+0x48/0x60
 [<ffffffffa47c31ba>] apic_timer_interrupt+0x16a/0x170
 <EOI>  [<ffffffffa40b3113>] ? get_next_timer_interrupt+0x103/0x270
 [<ffffffffa45eace7>] ? cpuidle_enter_state+0x57/0xd0
 [<ffffffffa45eae3e>] cpuidle_idle_call+0xde/0x270
 [<ffffffffa403919e>] arch_cpu_idle+0xe/0xc0
 [<ffffffffa410856a>] cpu_startup_entry+0x14a/0x1e0
 [<ffffffffa405cbb7>] start_secondary+0x207/0x280
 [<ffffffffa40000d5>] start_cpu+0x5/0x14
 
Another one that we see is quite similar to what is happening on cpu 21 in the original BZ.

Call Trace:
 <NMI>  [<ffffffff85fae072>] dump_stack+0x19/0x1b
 [<ffffffff85fa71e7>] panic+0xe8/0x21f
...
 [<ffffffff8591f4e8>] ? native_queued_spin_lock_slowpath+0x158/0x200
 <EOE>  [<ffffffff85fa7dd2>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff85fb7197>] _raw_spin_lock_irqsave+0x47/0x50
 [<ffffffff858b1b8b>] lock_timer_base.isra.38+0x2b/0x50
 [<ffffffff858b244f>] try_to_del_timer_sync+0x2f/0x90
 [<ffffffff858b2502>] del_timer_sync+0x52/0x60
 [<ffffffff85fb1920>] schedule_timeout+0x180/0x320
 [<ffffffff858b1870>] ? requeue_timers+0x1f0/0x1f0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

timer.patch
3 kB
27/Feb/21 4:48 PM

Issue Links

is related to

LU-14226 kernel update [RHEL7.9 3.10.0-1160.11.1.el7]

Resolved

LU-14395 kernel update [RHEL7.9 3.10.0-1160.15.2.el7]

Resolved

LU-14527 kernel update [RHEL7.9 3.10.0-1160.21.1.el7]

Resolved

Activity

[LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1

Jeremy Filizetti added a comment - 02/Mar/21 5:52 PM

3.10.0-1160.8.1.el7

Jeremy Filizetti added a comment - 02/Mar/21 5:52 PM 3.10.0-1160.8.1.el7

James A Simmons added a comment - 02/Mar/21 5:01 PM

Which kernel did this start showing up?

James A Simmons added a comment - 02/Mar/21 5:01 PM Which kernel did this start showing up?

Jeremy Filizetti added a comment - 02/Mar/21 2:31 PM

FWIW I have not seen any issues since upgrading to the hotfix kernel.

Jeremy Filizetti added a comment - 02/Mar/21 2:31 PM FWIW I have not seen any issues since upgrading to the hotfix kernel.

Jeremy Filizetti added a comment - 27/Feb/21 4:48 PM

This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet.
timer.patch

Jeremy Filizetti added a comment - 27/Feb/21 4:48 PM This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet. timer.patch

Andreas Dilger added a comment - 23/Feb/21 7:29 PM

Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

Andreas Dilger added a comment - 23/Feb/21 7:29 PM Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

Jeremy Filizetti added a comment - 23/Feb/21 5:59 PM

In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

Jeremy Filizetti added a comment - 23/Feb/21 5:59 PM In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

Andreas Dilger added a comment - 23/Feb/21 5:32 PM

What is the commit hash for that patch? I can't find it in the upstream kernel.

Andreas Dilger added a comment - 23/Feb/21 5:32 PM What is the commit hash for that patch? I can't find it in the upstream kernel.

Jeremy Filizetti added a comment - 08/Feb/21 12:02 PM

My investigation found it was introduced in 3.10.0-1160.8.1.el7 by

[kernel] timer: Fix lockup in __run_timers() caused by large jiffies/timer_jiffies delta (Waiman Long) [1849716]

Jeremy Filizetti added a comment - 08/Feb/21 12:02 PM My investigation found it was introduced in 3.10.0-1160.8.1.el7 by [kernel] timer: Fix lockup in __run_timers() caused by large jiffies/timer_jiffies delta (Waiman Long) [1849716]

Stephane Thiell added a comment - 08/Feb/21 4:11 AM

FYI I tried Lustre (server) with kernel-3.10.0-1160.15.2.el7 and immediately got the same hard lockups, so it looks like this issue hasn't been fixed by Redhat yet...

Does anyone know if this incompatibility with Lustre since 3.10.0-1160.11.1 was introduced by the following patch?

Loop in __run_timers() because base->timer_jiffies is very far behind causes a lockup condition. (BZ#1849716)

Stephane Thiell added a comment - 08/Feb/21 4:11 AM FYI I tried Lustre (server) with kernel-3.10.0-1160.15.2.el7 and immediately got the same hard lockups, so it looks like this issue hasn't been fixed by Redhat yet... Does anyone know if this incompatibility with Lustre since 3.10.0-1160.11.1 was introduced by the following patch? Loop in __run_timers() because base->timer_jiffies is very far behind causes a lockup condition. (BZ#1849716)

Olaf Faaland added a comment - 20/Jan/21 7:05 PM

Peter,

Sorry I didn't see your comment earlier. Yes, this ticket is for the benefit of others.

Olaf Faaland added a comment - 20/Jan/21 7:05 PM Peter, Sorry I didn't see your comment earlier. Yes, this ticket is for the benefit of others.

Peter Jones added a comment - 20/Jan/21 1:02 AM

Olaf

Do I read this correctly that you are mostly just opening this ticket for the benefit of others and not expecting action on our part at this time?

Peter

Peter Jones added a comment - 20/Jan/21 1:02 AM Olaf Do I read this correctly that you are mostly just opening this ticket for the benefit of others and not expecting action on our part at this time? Peter

People

Assignee:: Peter Jones

Reporter:: Olaf Faaland

Votes:: 1 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 20/Jan/21 12:45 AM

Updated:: 31/Dec/23 6:47 PM

Resolved:: 05/May/21 9:12 PM