Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14341

hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • None
    • kernel-3.10.0-1160.11.1
      lustre-2.12.6_2.llnl-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      hard LOCKUP and panic.  Most frequently observed on OSTs after mdtest completes or after OST mount , a few seconds after "deleting orphan objects from" console log messages. 

      This appeared to be due to kernel timer behavior changes introduced between kernel-3.10.0-1160.6.1 and kernel-3.10.0-1160.11.1. 

      Fix in progress.  See https://bugzilla.redhat.com/show_bug.cgi?id=1914011

      For brevity, only the bottoms of the stacks, are listed below.

      Kernel panic - not syncing: Hard LOCKUP
      CPU: 14 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-1160.11.1.1chaos.ch6.x86_64 #1
      ...
      Call Trace:
       <NMI>  [<ffffffffa47ae072>] dump_stack+0x19/0x1b
       [<ffffffffa47a71e7>] panic+0xe8/0x21f
      ...
       [<ffffffffa40b1edc>] ? run_timer_softirq+0xbc/0x370
       <EOE>  <IRQ>  [<ffffffffa40a82fd>] __do_softirq+0xfd/0x2c0
       [<ffffffffa47c56ec>] call_softirq+0x1c/0x30
       [<ffffffffa4030995>] do_softirq+0x65/0xa0
       [<ffffffffa40a86d5>] irq_exit+0x105/0x110
       [<ffffffffa47c6c88>] smp_apic_timer_interrupt+0x48/0x60
       [<ffffffffa47c31ba>] apic_timer_interrupt+0x16a/0x170
       <EOI>  [<ffffffffa40b3113>] ? get_next_timer_interrupt+0x103/0x270
       [<ffffffffa45eace7>] ? cpuidle_enter_state+0x57/0xd0
       [<ffffffffa45eae3e>] cpuidle_idle_call+0xde/0x270
       [<ffffffffa403919e>] arch_cpu_idle+0xe/0xc0
       [<ffffffffa410856a>] cpu_startup_entry+0x14a/0x1e0
       [<ffffffffa405cbb7>] start_secondary+0x207/0x280
       [<ffffffffa40000d5>] start_cpu+0x5/0x14
       
      Another one that we see is quite similar to what is happening on cpu 21 in the original BZ.
      
      Call Trace:
       <NMI>  [<ffffffff85fae072>] dump_stack+0x19/0x1b
       [<ffffffff85fa71e7>] panic+0xe8/0x21f
      ...
       [<ffffffff8591f4e8>] ? native_queued_spin_lock_slowpath+0x158/0x200
       <EOE>  [<ffffffff85fa7dd2>] queued_spin_lock_slowpath+0xb/0xf
       [<ffffffff85fb7197>] _raw_spin_lock_irqsave+0x47/0x50
       [<ffffffff858b1b8b>] lock_timer_base.isra.38+0x2b/0x50
       [<ffffffff858b244f>] try_to_del_timer_sync+0x2f/0x90
       [<ffffffff858b2502>] del_timer_sync+0x52/0x60
       [<ffffffff85fb1920>] schedule_timeout+0x180/0x320
       [<ffffffff858b1870>] ? requeue_timers+0x1f0/0x1f0 

      Attachments

        Issue Links

          Activity

            [LU-14341] hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1

            Olaf, since a couple of weeks passed may I ask: Did anything happen on Red Hat's side since you've opened Red Hat bug #1914011? Unfortunately, I'm not authorized to access it myself. Is Red Hat aware that their timer change causes this regression?

            Andreas, wouldn't it be a good idea to at least mention LU-14341 (i.e. this regression regarding patchless kernels) in the lustre/ChangeLog for the coming Lustre 2.12.7? AFAICS the timer revert is still necessary.

            knweiss Karsten Weiss added a comment - Olaf, since a couple of weeks passed may I ask: Did anything happen on Red Hat's side since you've opened Red Hat bug #1914011? Unfortunately, I'm not authorized to access it myself. Is Red Hat aware that their timer change causes this regression? Andreas, wouldn't it be a good idea to at least mention LU-14341 (i.e. this regression regarding patchless kernels) in the lustre/ChangeLog for the coming Lustre 2.12.7? AFAICS the timer revert is still necessary.

            knweiss those users will have to patch their kernel anyway, or stick with 3.10.0-1160-6.1.el7 or earlier until RHEL fixes the bug. There isn't anything that can be done in Lustre to avoid this, since it is in a core part of the kernel. Since the affected code was working fine for many years without the 1160.8.1 change, there is no reason to expect that reverting it will cause any problems.

            adilger Andreas Dilger added a comment - knweiss those users will have to patch their kernel anyway, or stick with 3.10.0-1160-6.1.el7 or earlier until RHEL fixes the bug. There isn't anything that can be done in Lustre to avoid this, since it is in a core part of the kernel. Since the affected code was working fine for many years without the 1160.8.1 change, there is no reason to expect that reverting it will cause any problems.

            I've noticed that LU-14395 (kernel: kernel update RHEL7.9 [3.10.0-1160.15.2.el7]) was merged in the b2_12 branch. It contains a patch that reverts the timer patch that was introduced upstream in kernel 3.10.0-1160.8.1.el7. How is this issue going to be fixed for Lustre servers with patchless kernels?

            knweiss Karsten Weiss added a comment - I've noticed that LU-14395 (kernel: kernel update RHEL7.9 [3.10.0-1160.15.2.el7] ) was merged in the b2_12 branch. It contains a patch that reverts the timer patch that was introduced upstream in kernel 3.10.0-1160.8.1.el7. How is this issue going to be fixed for Lustre servers with patchless kernels?
            gerrit Gerrit Updater added a comment - - edited

            The patch for master branch needs to be incorporated into https://review.whamcloud.com/41822.

            gerrit Gerrit Updater added a comment - - edited The patch for master branch needs to be incorporated into https://review.whamcloud.com/41822 .

            3.10.0-1160.8.1.el7

            jfilizetti Jeremy Filizetti added a comment - 3.10.0-1160.8.1.el7

            Which kernel did this start showing up?

            simmonsja James A Simmons added a comment - Which kernel did this start showing up?

            FWIW I have not seen any issues since upgrading to the hotfix kernel.

            jfilizetti Jeremy Filizetti added a comment - FWIW I have not seen any issues since upgrading to the hotfix kernel.

            This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet.
            timer.patch

            jfilizetti Jeremy Filizetti added a comment - This is what I believe the patch was from diffing between the kernels. I have a hotfix kernel from RHEL which is supposed to fix the problem but have not tested it yet. timer.patch

            Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

            adilger Andreas Dilger added a comment - Is there a copy of that patch somewhere? I was going to revert it and see if that fixes the problem.

            In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

            jfilizetti Jeremy Filizetti added a comment - In the digging I did I found no indication of an upstream patch for that. It looks like it was Redhat only to address some customer issue.

            What is the commit hash for that patch? I can't find it in the upstream kernel.

            adilger Andreas Dilger added a comment - What is the commit hash for that patch? I can't find it in the upstream kernel.

            People

              pjones Peter Jones
              ofaaland Olaf Faaland
              Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: