Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14341

hard LOCKUP lustre servers with kernel-3.10.0-1160.11.1

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • None
    • kernel-3.10.0-1160.11.1
      lustre-2.12.6_2.llnl-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      hard LOCKUP and panic.  Most frequently observed on OSTs after mdtest completes or after OST mount , a few seconds after "deleting orphan objects from" console log messages. 

      This appeared to be due to kernel timer behavior changes introduced between kernel-3.10.0-1160.6.1 and kernel-3.10.0-1160.11.1. 

      Fix in progress.  See https://bugzilla.redhat.com/show_bug.cgi?id=1914011

      For brevity, only the bottoms of the stacks, are listed below.

      Kernel panic - not syncing: Hard LOCKUP
      CPU: 14 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-1160.11.1.1chaos.ch6.x86_64 #1
      ...
      Call Trace:
       <NMI>  [<ffffffffa47ae072>] dump_stack+0x19/0x1b
       [<ffffffffa47a71e7>] panic+0xe8/0x21f
      ...
       [<ffffffffa40b1edc>] ? run_timer_softirq+0xbc/0x370
       <EOE>  <IRQ>  [<ffffffffa40a82fd>] __do_softirq+0xfd/0x2c0
       [<ffffffffa47c56ec>] call_softirq+0x1c/0x30
       [<ffffffffa4030995>] do_softirq+0x65/0xa0
       [<ffffffffa40a86d5>] irq_exit+0x105/0x110
       [<ffffffffa47c6c88>] smp_apic_timer_interrupt+0x48/0x60
       [<ffffffffa47c31ba>] apic_timer_interrupt+0x16a/0x170
       <EOI>  [<ffffffffa40b3113>] ? get_next_timer_interrupt+0x103/0x270
       [<ffffffffa45eace7>] ? cpuidle_enter_state+0x57/0xd0
       [<ffffffffa45eae3e>] cpuidle_idle_call+0xde/0x270
       [<ffffffffa403919e>] arch_cpu_idle+0xe/0xc0
       [<ffffffffa410856a>] cpu_startup_entry+0x14a/0x1e0
       [<ffffffffa405cbb7>] start_secondary+0x207/0x280
       [<ffffffffa40000d5>] start_cpu+0x5/0x14
       
      Another one that we see is quite similar to what is happening on cpu 21 in the original BZ.
      
      Call Trace:
       <NMI>  [<ffffffff85fae072>] dump_stack+0x19/0x1b
       [<ffffffff85fa71e7>] panic+0xe8/0x21f
      ...
       [<ffffffff8591f4e8>] ? native_queued_spin_lock_slowpath+0x158/0x200
       <EOE>  [<ffffffff85fa7dd2>] queued_spin_lock_slowpath+0xb/0xf
       [<ffffffff85fb7197>] _raw_spin_lock_irqsave+0x47/0x50
       [<ffffffff858b1b8b>] lock_timer_base.isra.38+0x2b/0x50
       [<ffffffff858b244f>] try_to_del_timer_sync+0x2f/0x90
       [<ffffffff858b2502>] del_timer_sync+0x52/0x60
       [<ffffffff85fb1920>] schedule_timeout+0x180/0x320
       [<ffffffff858b1870>] ? requeue_timers+0x1f0/0x1f0 

      Attachments

        Issue Links

          Activity

            People

              pjones Peter Jones
              ofaaland Olaf Faaland
              Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: