Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0
    • Lustre 2.2.0, Lustre 2.3.0
    • None
    • 3
    • 4631

    Description

      ORNL reported this lockup

      [  446.217881] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      [  446.217881] Pid: 9709, comm: kiblnd_sd_05 Not tainted 2.6.32-220.el6.wc.x86_64 #1
      [  446.217881] Call Trace:
      [  446.217881]  <NMI>  [<ffffffff814ec681>] ? panic+0x78/0x143
      [  446.217881]  [<ffffffff810d8fad>] ? watchdog_overflow_callback+0xcd/0xd0
      [  446.217881]  [<ffffffff8110a89d>] ? __perf_event_overflow+0x9d/0x230
      [  446.217881]  [<ffffffff8110ae54>] ? perf_event_overflow+0x14/0x20
      [  446.217881]  [<ffffffff8101e096>] ? intel_pmu_handle_irq+0x336/0x550
      [  446.217881]  [<ffffffff814f2256>] ? kprobe_exceptions_notify+0x16/0x430
      [  446.217881]  [<ffffffff814f0d39>] ? perf_event_nmi_handler+0x39/0xb0
      [  446.217881]  [<ffffffff814f2885>] ? notifier_call_chain+0x55/0x80
      [  446.217881]  [<ffffffff814f28ea>] ? atomic_notifier_call_chain+0x1a/0x20
      [  446.217881]  [<ffffffff81096bce>] ? notify_die+0x2e/0x30
      [  446.217881]  [<ffffffff814f0503>] ? do_nmi+0x173/0x2b0
      [  446.217881]  [<ffffffff814efe10>] ? nmi+0x20/0x30
      [  446.217881]  [<ffffffff814ef67c>] ? _spin_lock+0x1c/0x30
      [  446.217881]  <<EOE>>  [<ffffffffa07473bf>] ? cfs_trace_lock_tcd+0x2f/0x90 [libcfs]
      [  446.217881]  [<ffffffffa0750b0a>] ? libcfs_debug_vmsg2+0xda/0xb60 [libcfs]
      [  446.217881]  [<ffffffffa07515d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [  446.217881]  [<ffffffffa07515d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [  446.217881]  [<ffffffffa0b5768d>] ? kiblnd_scheduler+0x16d/0x620 [ko2iblnd]
      [  446.217881]  [<ffffffff8105fa50>] ? default_wake_function+0x0/0x20
      [  446.217881]  [<ffffffffa0b57520>] ? kiblnd_scheduler+0x0/0x620 [ko2iblnd]
      [  446.217881]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [  446.217881]  [<ffffffffa0b57520>] ? kiblnd_scheduler+0x0/0x620 [ko2iblnd]
      [  446.217881]  [<ffffffffa0b57520>] ? kiblnd_scheduler+0x0/0x620 [ko2iblnd]
      [  446.217881]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      the other thread is in irq context:

      PID: 9855   TASK: ffff880431898b40  CPU: 0   COMMAND: "ktracefiled"
       #0 [ffff880028207e90] crash_nmi_callback at ffffffff81029716
       #1 [ffff880028207ea0] notifier_call_chain at ffffffff814f2885
       #2 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff814f28ea
       #3 [ffff880028207ef0] notify_die at ffffffff81096bce
       #4 [ffff880028207f20] do_nmi at ffffffff814f0503
       #5 [ffff880028207f50] nmi at ffffffff814efe10
          [exception RIP: _spin_lock_irqsave+47]
          RIP: ffffffff814ef56f  RSP: ffff880028203de8  RFLAGS: 00000097
          RAX: 00000000000055c2  RBX: ffff88040df2b800  RCX: 00000000000055c1
          RDX: 0000000000000046  RSI: ffff88040df2b800  RDI: ffffffffa0b71c30
          RBP: ffff880028203de8   R8: ffff8803fbb04000   R9: ffff8803fbb05de0
          R10: ffff880028403a40  R11: 0000000000000000  R12: 0000000000000087
          R13: 0000000000000000  R14: ffff8804318da000  R15: ffff8803fbb05f00
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #6 [ffff880028203de8] _spin_lock_irqsave at ffffffff814ef56f
       #7 [ffff880028203df0] kiblnd_cq_completion at ffffffffa0b51548 [ko2iblnd]   <== this is line 3277 (cfs_spin_lock_irqsave(&kiblnd_data.kib_sched_lock, flags);) in o2iblnd_cb.c
       #8 [ffff880028203e10] mthca_cq_completion at ffffffffa019ca4a [ib_mthca]
       #9 [ffff880028203e30] mthca_eq_int at ffffffffa019c422 [ib_mthca]
      #10 [ffff880028203eb0] mthca_arbel_msi_x_interrupt at ffffffffa019c694 [ib_mthca]
      #11 [ffff880028203ed0] handle_IRQ_event at ffffffff810d94f0
      #12 [ffff880028203f20] handle_edge_irq at ffffffff810dbc2e
      #13 [ffff880028203f60] handle_irq at ffffffff8100df09
      #14 [ffff880028203f80] do_IRQ at ffffffff814f504c
      --- <IRQ stack> ---
      #15 [ffff8803fbb05d38] ret_from_intr at ffffffff8100ba53
          [exception RIP: put_pages_on_tcd_daemon_list+255]
          RIP: ffffffffa075072f  RSP: ffff8803fbb05de0  RFLAGS: 00000246
          RAX: ffffea000c9a82a8  RBX: ffff8803fbb05e20  RCX: ffff880390407b88
          RDX: ffff88040d080128  RSI: ffff880390407b48  RDI: ffff880390407b68
          RBP: ffffffff8100ba4e   R8: ffff88040d080128   R9: 0000000000000000
          R10: ffff880028403a40  R11: 0000000000000000  R12: ffff8803fbb05d70
          R13: ffff8803fbb05f00  R14: ffff8803fbb05ef0  R15: 0000000000000286
          ORIG_RAX: ffffffffffffff46  CS: 0010  SS: 0018
      #16 [ffff8803fbb05e28] put_pages_on_daemon_list at ffffffffa075094b [libcfs]  <== this is line 575 (cfs_tcd_for_each_type_lock) in tracefile.c
      #17 [ffff8803fbb05e78] tracefiled at ffffffffa0751d4f [libcfs]
      #18 [ffff8803fbb05f48] kernel_thread at ffffffff8100c14a
      

      Attachments

        Activity

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: