[LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
kernel-3.10.0-1160.11.1
lustre-2.12.6_2.llnl-1.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

hard LOCKUP and panic. Most frequently observed on OSTs after mdtest completes or after OST mount , a few seconds after "deleting orphan objects from" console log messages.

This appeared to be due to kernel timer behavior changes introduced between kernel-3.10.0-1160.6.1 and kernel-3.10.0-1160.11.1.

Fix in progress. See https://bugzilla.redhat.com/show_bug.cgi?id=1914011

For brevity, only the bottoms of the stacks, are listed below.

Kernel panic - not syncing: Hard LOCKUP
CPU: 14 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-1160.11.1.1chaos.ch6.x86_64 #1
...
Call Trace:
 <NMI>  [<ffffffffa47ae072>] dump_stack+0x19/0x1b
 [<ffffffffa47a71e7>] panic+0xe8/0x21f
...
 [<ffffffffa40b1edc>] ? run_timer_softirq+0xbc/0x370
 <EOE>  <IRQ>  [<ffffffffa40a82fd>] __do_softirq+0xfd/0x2c0
 [<ffffffffa47c56ec>] call_softirq+0x1c/0x30
 [<ffffffffa4030995>] do_softirq+0x65/0xa0
 [<ffffffffa40a86d5>] irq_exit+0x105/0x110
 [<ffffffffa47c6c88>] smp_apic_timer_interrupt+0x48/0x60
 [<ffffffffa47c31ba>] apic_timer_interrupt+0x16a/0x170
 <EOI>  [<ffffffffa40b3113>] ? get_next_timer_interrupt+0x103/0x270
 [<ffffffffa45eace7>] ? cpuidle_enter_state+0x57/0xd0
 [<ffffffffa45eae3e>] cpuidle_idle_call+0xde/0x270
 [<ffffffffa403919e>] arch_cpu_idle+0xe/0xc0
 [<ffffffffa410856a>] cpu_startup_entry+0x14a/0x1e0
 [<ffffffffa405cbb7>] start_secondary+0x207/0x280
 [<ffffffffa40000d5>] start_cpu+0x5/0x14
 
Another one that we see is quite similar to what is happening on cpu 21 in the original BZ.

Call Trace:
 <NMI>  [<ffffffff85fae072>] dump_stack+0x19/0x1b
 [<ffffffff85fa71e7>] panic+0xe8/0x21f
...
 [<ffffffff8591f4e8>] ? native_queued_spin_lock_slowpath+0x158/0x200
 <EOE>  [<ffffffff85fa7dd2>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff85fb7197>] _raw_spin_lock_irqsave+0x47/0x50
 [<ffffffff858b1b8b>] lock_timer_base.isra.38+0x2b/0x50
 [<ffffffff858b244f>] try_to_del_timer_sync+0x2f/0x90
 [<ffffffff858b2502>] del_timer_sync+0x52/0x60
 [<ffffffff85fb1920>] schedule_timeout+0x180/0x320
 [<ffffffff858b1870>] ? requeue_timers+0x1f0/0x1f0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

timer.patch
3 kB
27/Feb/21 4:48 PM

Issue Links

is related to

LU-14226 kernel update [RHEL7.9 3.10.0-1160.11.1.el7]

Resolved

LU-14395 kernel update [RHEL7.9 3.10.0-1160.15.2.el7]

Resolved

LU-14527 kernel update [RHEL7.9 3.10.0-1160.21.1.el7]

Resolved

Activity

[LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1

Karsten Weiss added a comment - 05/May/21 12:43 PM

Olaf, since a couple of weeks passed may I ask: Did anything happen on Red Hat's side since you've opened Red Hat bug #1914011? Unfortunately, I'm not authorized to access it myself. Is Red Hat aware that their timer change causes this regression?

Andreas, wouldn't it be a good idea to at least mention ~~LU-14341~~ (i.e. this regression regarding patchless kernels) in the lustre/ChangeLog for the coming Lustre 2.12.7? AFAICS the timer revert is still necessary.

Karsten Weiss added a comment - 05/May/21 12:43 PM Olaf, since a couple of weeks passed may I ask: Did anything happen on Red Hat's side since you've opened Red Hat bug #1914011? Unfortunately, I'm not authorized to access it myself. Is Red Hat aware that their timer change causes this regression? Andreas, wouldn't it be a good idea to at least mention LU-14341 (i.e. this regression regarding patchless kernels) in the lustre/ChangeLog for the coming Lustre 2.12.7? AFAICS the timer revert is still necessary.

Andreas Dilger added a comment - 04/Mar/21 11:09 AM

knweiss those users will have to patch their kernel anyway, or stick with 3.10.0-1160-6.1.el7 or earlier until RHEL fixes the bug. There isn't anything that can be done in Lustre to avoid this, since it is in a core part of the kernel. Since the affected code was working fine for many years without the 1160.8.1 change, there is no reason to expect that reverting it will cause any problems.

Andreas Dilger added a comment - 04/Mar/21 11:09 AM knweiss those users will have to patch their kernel anyway, or stick with 3.10.0-1160-6.1.el7 or earlier until RHEL fixes the bug. There isn't anything that can be done in Lustre to avoid this, since it is in a core part of the kernel. Since the affected code was working fine for many years without the 1160.8.1 change, there is no reason to expect that reverting it will cause any problems.

Karsten Weiss added a comment - 04/Mar/21 9:41 AM

I've noticed that ~~LU-14395~~ (kernel: kernel update RHEL7.9 [3.10.0-1160.15.2.el7]) was merged in the b2_12 branch. It contains a patch that reverts the timer patch that was introduced upstream in kernel 3.10.0-1160.8.1.el7. How is this issue going to be fixed for Lustre servers with patchless kernels?

Karsten Weiss added a comment - 04/Mar/21 9:41 AM I've noticed that LU-14395 (kernel: kernel update RHEL7.9 [3.10.0-1160.15.2.el7] ) was merged in the b2_12 branch. It contains a patch that reverts the timer patch that was introduced upstream in kernel 3.10.0-1160.8.1.el7. How is this issue going to be fixed for Lustre servers with patchless kernels?

Gerrit Updater added a comment - 03/Mar/21 3:12 AM - edited

The patch for master branch needs to be incorporated into https://review.whamcloud.com/41822.

Gerrit Updater added a comment - 03/Mar/21 3:12 AM - edited The patch for master branch needs to be incorporated into https://review.whamcloud.com/41822 .

Jeremy Filizetti added a comment - 02/Mar/21 5:52 PM

3.10.0-1160.8.1.el7

Jeremy Filizetti added a comment - 02/Mar/21 5:52 PM 3.10.0-1160.8.1.el7

James A Simmons added a comment - 02/Mar/21 5:01 PM

Which kernel did this start showing up?

James A Simmons added a comment - 02/Mar/21 5:01 PM Which kernel did this start showing up?

Jeremy Filizetti added a comment - 02/Mar/21 2:31 PM

FWIW I have not seen any issues since upgrading to the hotfix kernel.

Jeremy Filizetti added a comment - 02/Mar/21 2:31 PM FWIW I have not seen any issues since upgrading to the hotfix kernel.

Jeremy Filizetti added a comment - 27/Feb/21 4:48 PM

This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet.
timer.patch

Jeremy Filizetti added a comment - 27/Feb/21 4:48 PM This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet. timer.patch

Andreas Dilger added a comment - 23/Feb/21 7:29 PM

Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

Andreas Dilger added a comment - 23/Feb/21 7:29 PM Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

Jeremy Filizetti added a comment - 23/Feb/21 5:59 PM

In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

Jeremy Filizetti added a comment - 23/Feb/21 5:59 PM In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

Andreas Dilger added a comment - 23/Feb/21 5:32 PM

What is the commit hash for that patch? I can't find it in the upstream kernel.

Andreas Dilger added a comment - 23/Feb/21 5:32 PM What is the commit hash for that patch? I can't find it in the upstream kernel.

People

Assignee:: Peter Jones

Reporter:: Olaf Faaland

Votes:: 1 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 20/Jan/21 12:45 AM

Updated:: 31/Dec/23 6:47 PM

Resolved:: 05/May/21 9:12 PM