Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7886

Hard lockup from debug logging

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.5.5
    • None
    • 2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64
    • 3
    • 9223372036854775807

    Description

      An OSS panic'ed because a watchdog detected a hard lockup.

      [1643703.194240] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      [1643703.194243] Pid: 42640, comm: libcfs_debug_du Tainted: G           --L------------    2.6.32-573.12.1.el6.atlas.x86_64 #1
      [1643703.194245] Call Trace:
      [1643703.194246]  <NMI>  [<ffffffff81538ae1>] ? panic+0xa7/0x16f
      [1643703.194257]  [<ffffffff810149c9>] ? sched_clock+0x9/0x10
      [1643703.194261]  [<ffffffff810ed8ad>] ? watchdog_overflow_callback+0xcd/0xd0
      [1643703.194264]  [<ffffffff81124427>] ? __perf_event_overflow+0xa7/0x240
      [1643703.194267]  [<ffffffff8101dc54>] ? x86_perf_event_set_period+0xf4/0x180
      [1643703.194270]  [<ffffffff81124a74>] ? perf_event_overflow+0x14/0x20
      [1643703.194272]  [<ffffffff81024a02>] ? intel_pmu_handle_irq+0x202/0x3f0
      [1643703.194276]  [<ffffffff8153dde9>] ? perf_event_nmi_handler+0x39/0xb0
      [1643703.194278]  [<ffffffff8153f8a5>] ? notifier_call_chain+0x55/0x80
      [1643703.194284]  [<ffffffff8153f90a>] ? atomic_notifier_call_chain+0x1a/0x20
      [1643703.194287]  [<ffffffff810a783e>] ? notify_die+0x2e/0x30
      [1643703.194289]  [<ffffffff8153d563>] ? do_nmi+0x1c3/0x350
      [1643703.194292]  [<ffffffff8153ce20>] ? nmi+0x20/0x30
      [1643703.194295]  [<ffffffff8153c5b8>] ? _spin_lock_irq+0x28/0x40
      [1643703.194296]  <<EOE>>  [<ffffffffa0461445>] ? cfs_trace_lock_tcd+0x95/0xa0 [libcfs]
      [1643703.194318]  [<ffffffffa047202b>] ? collect_pages+0x25b/0x290 [libcfs]
      [1643703.194327]  [<ffffffffa047270c>] ? cfs_tracefile_dump_all_pages+0x5c/0x2e0 [libcfs]
      [1643703.194330]  [<ffffffff81538bea>] ? printk+0x41/0x47
      [1643703.194338]  [<ffffffffa046c720>] ? libcfs_debug_dumplog_thread+0x0/0x30 [libcfs]
      [1643703.194347]  [<ffffffffa046c705>] ? libcfs_debug_dumplog_internal+0xb5/0xd0 [libcfs]
      [1643703.194356]  [<ffffffffa046c72e>] ? libcfs_debug_dumplog_thread+0xe/0x30 [libcfs]
      [1643703.194360]  [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
      [1643703.194363]  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      [1643703.194365]  [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      [1643703.194367]  [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      CPU2 was spinning on a tcd spinlock. We got a crashdump, here's the register dump:

      #12 [ffff880044646f50] nmi at ffffffff8153ce20
          [exception RIP: _spin_lock_irq+40]
          RIP: ffffffff8153c5b8  RSP: ffff8803e8187da0  RFLAGS: 00000097
          RAX: 00000000000003b1  RBX: ffff88080ce90600  RCX: 0000000000000600
          RDX: 00000000000003b0  RSI: 0000000000000001  RDI: ffff88080ce90600
          RBP: ffff8803e8187da0   R8: ffffffff81c17c00   R9: 0000000000000000
          R10: 0000000000000010  R11: 0000000000000001  R12: 0000000000000001
          R13: ffff88080ce90000  R14: ffffffffa0497e88  R15: ffff88080ce90600
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
      #13 [ffff8803e8187da0] _spin_lock_irq at ffffffff8153c5b8
      

      Which tcd lock was it spinning on? It's in spin_lock_irq so we know the TCD type was 0. Let's grab our 16 lock addresses

      crash> p &cfs_trace_data[0][0][0]->tcd.tcd_lock
      $21 = (spinlock_t *) 0xffff88080ce90000
      crash> p &cfs_trace_data[0][0][1]->tcd.tcd_lock
      $22 = (spinlock_t *) 0xffff88080ce90080
      crash> p &cfs_trace_data[0][0][2]->tcd.tcd_lock
      $23 = (spinlock_t *) 0xffff88080ce90100
      crash> p &cfs_trace_data[0][0][3]->tcd.tcd_lock
      $24 = (spinlock_t *) 0xffff88080ce90180
      crash> p &cfs_trace_data[0][0][4]->tcd.tcd_lock
      $25 = (spinlock_t *) 0xffff88080ce90200
      crash> p &cfs_trace_data[0][0][5]->tcd.tcd_lock
      $26 = (spinlock_t *) 0xffff88080ce90280
      crash> p &cfs_trace_data[0][0][6]->tcd.tcd_lock
      $27 = (spinlock_t *) 0xffff88080ce90300
      crash> p &cfs_trace_data[0][0][7]->tcd.tcd_lock
      $28 = (spinlock_t *) 0xffff88080ce90380
      crash> p &cfs_trace_data[0][0][8]->tcd.tcd_lock
      $29 = (spinlock_t *) 0xffff88080ce90400
      crash> p &cfs_trace_data[0][0][9]->tcd.tcd_lock
      $30 = (spinlock_t *) 0xffff88080ce90480
      crash> p &cfs_trace_data[0][0][10]->tcd.tcd_lock
      $31 = (spinlock_t *) 0xffff88080ce90500
      crash> p &cfs_trace_data[0][0][11]->tcd.tcd_lock
      $32 = (spinlock_t *) 0xffff88080ce90580
      crash> p &cfs_trace_data[0][0][12]->tcd.tcd_lock
      $33 = (spinlock_t *) 0xffff88080ce90600
      crash> p &cfs_trace_data[0][0][13]->tcd.tcd_lock
      $34 = (spinlock_t *) 0xffff88080ce90680
      crash> p &cfs_trace_data[0][0][14]->tcd.tcd_lock
      $35 = (spinlock_t *) 0xffff88080ce90700
      crash> p &cfs_trace_data[0][0][15]->tcd.tcd_lock
      $36 = (spinlock_t *) 0xffff88080ce90780
      

      The lock address for CPU12 (0xffff88080ce90600) is in several registers, so it seems a likely culprit. What's he doing?

      PID: 13581  TASK: ffff880e348baab0  CPU: 12  COMMAND: "ll_ost_io03_058"
       #0 [ffff88089c486e90] crash_nmi_callback at ffffffff81033cf6
       #1 [ffff88089c486ea0] notifier_call_chain at ffffffff8153f8a5
       #2 [ffff88089c486ee0] atomic_notifier_call_chain at ffffffff8153f90a
       #3 [ffff88089c486ef0] notify_die at ffffffff810a783e
       #4 [ffff88089c486f20] do_nmi at ffffffff8153d563
       #5 [ffff88089c486f50] nmi at ffffffff8153ce20
          [exception RIP: native_read_tsc+6]
          RIP: ffffffff810156e6  RSP: ffff880e332b3720  RFLAGS: 00000046
          RAX: 00000000433eaf0e  RBX: 000000000000000c  RCX: 00000000433eaed6
          RDX: 000000000020692a  RSI: 000000000027aacd  RDI: 0000000000000a28
          RBP: ffff880e332b3720   R8: 0000000000eb986a   R9: ffff8800000bd000
          R10: 0000000000000000  R11: 0000000000000198  R12: 00000000433ea79a
          R13: 000000000000000c  R14: 0000000000000a28  R15: ffffffff8136aec0
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #6 [ffff880e332b3720] native_read_tsc at ffffffff810156e6
       #7 [ffff880e332b3728] delay_tsc at ffffffff8129d2ba
       #8 [ffff880e332b3758] __const_udelay at ffffffff8129d266
       #9 [ffff880e332b3768] wait_for_xmitr at ffffffff8136ae42
      #10 [ffff880e332b3798] serial8250_console_putchar at ffffffff8136aee6
      #11 [ffff880e332b37b8] uart_console_write at ffffffff8136688e
      #12 [ffff880e332b37f8] serial8250_console_write at ffffffff8136b24d
      #13 [ffff880e332b3848] __call_console_drivers at ffffffff810775f5
      #14 [ffff880e332b3878] _call_console_drivers at ffffffff8107765a
      #15 [ffff880e332b3898] release_console_sem at ffffffff81077ca8
      #16 [ffff880e332b38d8] vprintk at ffffffff810783a8
      #17 [ffff880e332b3978] printk at ffffffff81538bea
      #18 [ffff880e332b39d8] cfs_print_to_console at ffffffffa0461114 [libcfs]
      #19 [ffff880e332b3a08] libcfs_debug_vmsg2 at ffffffffa04716ae [libcfs]
      #20 [ffff880e332b3b78] _debug_req at ffffffffa0753823 [ptlrpc]
      #21 [ffff880e332b3da8] ptlrpc_main at ffffffffa0764b42 [ptlrpc]
      #22 [ffff880e332b3ee8] kthread at ffffffff810a0fce
      #23 [ffff880e332b3f48] kernel_thread at ffffffff8100c28a
      

      CPU12 is running libcfs_debug_vmsg2 so he would be holding a local tcd lock. He's sending data out on the serial console (it's serial-over-lan through the BMC). Running 'dmesg' inside crash shows the panic happened at 1643703, but there were 1611 messages logged at time 1643640. That's the last entry we see on our central logging server. Was the log volume so high or the serial console so slow, that it was able to deny access to the spinlock for 10s?

      Note, we had console_ratelimit disabled due to LU-7867 to make sure we see all the problematic objects. If my above hypothesis is true, then enabling console_ratelimit would have likely prevented the crash. Is that just an unsafe option, or is there something we can do to improve it?

      Attachments

        Activity

          People

            yujian Jian Yu
            ezell Matt Ezell
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: