Loading...

Details

Type: Bug
Resolution: Won't Fix
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.5.5
Labels:
None
Environment:
2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

An OSS panic'ed because a watchdog detected a hard lockup.

[1643703.194240] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[1643703.194243] Pid: 42640, comm: libcfs_debug_du Tainted: G           --L------------    2.6.32-573.12.1.el6.atlas.x86_64 #1
[1643703.194245] Call Trace:
[1643703.194246]  <NMI>  [<ffffffff81538ae1>] ? panic+0xa7/0x16f
[1643703.194257]  [<ffffffff810149c9>] ? sched_clock+0x9/0x10
[1643703.194261]  [<ffffffff810ed8ad>] ? watchdog_overflow_callback+0xcd/0xd0
[1643703.194264]  [<ffffffff81124427>] ? __perf_event_overflow+0xa7/0x240
[1643703.194267]  [<ffffffff8101dc54>] ? x86_perf_event_set_period+0xf4/0x180
[1643703.194270]  [<ffffffff81124a74>] ? perf_event_overflow+0x14/0x20
[1643703.194272]  [<ffffffff81024a02>] ? intel_pmu_handle_irq+0x202/0x3f0
[1643703.194276]  [<ffffffff8153dde9>] ? perf_event_nmi_handler+0x39/0xb0
[1643703.194278]  [<ffffffff8153f8a5>] ? notifier_call_chain+0x55/0x80
[1643703.194284]  [<ffffffff8153f90a>] ? atomic_notifier_call_chain+0x1a/0x20
[1643703.194287]  [<ffffffff810a783e>] ? notify_die+0x2e/0x30
[1643703.194289]  [<ffffffff8153d563>] ? do_nmi+0x1c3/0x350
[1643703.194292]  [<ffffffff8153ce20>] ? nmi+0x20/0x30
[1643703.194295]  [<ffffffff8153c5b8>] ? _spin_lock_irq+0x28/0x40
[1643703.194296]  <<EOE>>  [<ffffffffa0461445>] ? cfs_trace_lock_tcd+0x95/0xa0 [libcfs]
[1643703.194318]  [<ffffffffa047202b>] ? collect_pages+0x25b/0x290 [libcfs]
[1643703.194327]  [<ffffffffa047270c>] ? cfs_tracefile_dump_all_pages+0x5c/0x2e0 [libcfs]
[1643703.194330]  [<ffffffff81538bea>] ? printk+0x41/0x47
[1643703.194338]  [<ffffffffa046c720>] ? libcfs_debug_dumplog_thread+0x0/0x30 [libcfs]
[1643703.194347]  [<ffffffffa046c705>] ? libcfs_debug_dumplog_internal+0xb5/0xd0 [libcfs]
[1643703.194356]  [<ffffffffa046c72e>] ? libcfs_debug_dumplog_thread+0xe/0x30 [libcfs]
[1643703.194360]  [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
[1643703.194363]  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
[1643703.194365]  [<ffffffff810a0f30>] ? kthread+0x0/0xc0
[1643703.194367]  [<ffffffff8100c280>] ? child_rip+0x0/0x20

CPU2 was spinning on a tcd spinlock. We got a crashdump, here's the register dump:

#12 [ffff880044646f50] nmi at ffffffff8153ce20
    [exception RIP: _spin_lock_irq+40]
    RIP: ffffffff8153c5b8  RSP: ffff8803e8187da0  RFLAGS: 00000097
    RAX: 00000000000003b1  RBX: ffff88080ce90600  RCX: 0000000000000600
    RDX: 00000000000003b0  RSI: 0000000000000001  RDI: ffff88080ce90600
    RBP: ffff8803e8187da0   R8: ffffffff81c17c00   R9: 0000000000000000
    R10: 0000000000000010  R11: 0000000000000001  R12: 0000000000000001
    R13: ffff88080ce90000  R14: ffffffffa0497e88  R15: ffff88080ce90600
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#13 [ffff8803e8187da0] _spin_lock_irq at ffffffff8153c5b8

Which tcd lock was it spinning on? It's in spin_lock_irq so we know the TCD type was 0. Let's grab our 16 lock addresses

crash> p &cfs_trace_data[0][0][0]->tcd.tcd_lock
$21 = (spinlock_t *) 0xffff88080ce90000
crash> p &cfs_trace_data[0][0][1]->tcd.tcd_lock
$22 = (spinlock_t *) 0xffff88080ce90080
crash> p &cfs_trace_data[0][0][2]->tcd.tcd_lock
$23 = (spinlock_t *) 0xffff88080ce90100
crash> p &cfs_trace_data[0][0][3]->tcd.tcd_lock
$24 = (spinlock_t *) 0xffff88080ce90180
crash> p &cfs_trace_data[0][0][4]->tcd.tcd_lock
$25 = (spinlock_t *) 0xffff88080ce90200
crash> p &cfs_trace_data[0][0][5]->tcd.tcd_lock
$26 = (spinlock_t *) 0xffff88080ce90280
crash> p &cfs_trace_data[0][0][6]->tcd.tcd_lock
$27 = (spinlock_t *) 0xffff88080ce90300
crash> p &cfs_trace_data[0][0][7]->tcd.tcd_lock
$28 = (spinlock_t *) 0xffff88080ce90380
crash> p &cfs_trace_data[0][0][8]->tcd.tcd_lock
$29 = (spinlock_t *) 0xffff88080ce90400
crash> p &cfs_trace_data[0][0][9]->tcd.tcd_lock
$30 = (spinlock_t *) 0xffff88080ce90480
crash> p &cfs_trace_data[0][0][10]->tcd.tcd_lock
$31 = (spinlock_t *) 0xffff88080ce90500
crash> p &cfs_trace_data[0][0][11]->tcd.tcd_lock
$32 = (spinlock_t *) 0xffff88080ce90580
crash> p &cfs_trace_data[0][0][12]->tcd.tcd_lock
$33 = (spinlock_t *) 0xffff88080ce90600
crash> p &cfs_trace_data[0][0][13]->tcd.tcd_lock
$34 = (spinlock_t *) 0xffff88080ce90680
crash> p &cfs_trace_data[0][0][14]->tcd.tcd_lock
$35 = (spinlock_t *) 0xffff88080ce90700
crash> p &cfs_trace_data[0][0][15]->tcd.tcd_lock
$36 = (spinlock_t *) 0xffff88080ce90780

The lock address for CPU12 (0xffff88080ce90600) is in several registers, so it seems a likely culprit. What's he doing?

PID: 13581  TASK: ffff880e348baab0  CPU: 12  COMMAND: "ll_ost_io03_058"
 #0 [ffff88089c486e90] crash_nmi_callback at ffffffff81033cf6
 #1 [ffff88089c486ea0] notifier_call_chain at ffffffff8153f8a5
 #2 [ffff88089c486ee0] atomic_notifier_call_chain at ffffffff8153f90a
 #3 [ffff88089c486ef0] notify_die at ffffffff810a783e
 #4 [ffff88089c486f20] do_nmi at ffffffff8153d563
 #5 [ffff88089c486f50] nmi at ffffffff8153ce20
    [exception RIP: native_read_tsc+6]
    RIP: ffffffff810156e6  RSP: ffff880e332b3720  RFLAGS: 00000046
    RAX: 00000000433eaf0e  RBX: 000000000000000c  RCX: 00000000433eaed6
    RDX: 000000000020692a  RSI: 000000000027aacd  RDI: 0000000000000a28
    RBP: ffff880e332b3720   R8: 0000000000eb986a   R9: ffff8800000bd000
    R10: 0000000000000000  R11: 0000000000000198  R12: 00000000433ea79a
    R13: 000000000000000c  R14: 0000000000000a28  R15: ffffffff8136aec0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880e332b3720] native_read_tsc at ffffffff810156e6
 #7 [ffff880e332b3728] delay_tsc at ffffffff8129d2ba
 #8 [ffff880e332b3758] __const_udelay at ffffffff8129d266
 #9 [ffff880e332b3768] wait_for_xmitr at ffffffff8136ae42
#10 [ffff880e332b3798] serial8250_console_putchar at ffffffff8136aee6
#11 [ffff880e332b37b8] uart_console_write at ffffffff8136688e
#12 [ffff880e332b37f8] serial8250_console_write at ffffffff8136b24d
#13 [ffff880e332b3848] __call_console_drivers at ffffffff810775f5
#14 [ffff880e332b3878] _call_console_drivers at ffffffff8107765a
#15 [ffff880e332b3898] release_console_sem at ffffffff81077ca8
#16 [ffff880e332b38d8] vprintk at ffffffff810783a8
#17 [ffff880e332b3978] printk at ffffffff81538bea
#18 [ffff880e332b39d8] cfs_print_to_console at ffffffffa0461114 [libcfs]
#19 [ffff880e332b3a08] libcfs_debug_vmsg2 at ffffffffa04716ae [libcfs]
#20 [ffff880e332b3b78] _debug_req at ffffffffa0753823 [ptlrpc]
#21 [ffff880e332b3da8] ptlrpc_main at ffffffffa0764b42 [ptlrpc]
#22 [ffff880e332b3ee8] kthread at ffffffff810a0fce
#23 [ffff880e332b3f48] kernel_thread at ffffffff8100c28a

CPU12 is running libcfs_debug_vmsg2 so he would be holding a local tcd lock. He's sending data out on the serial console (it's serial-over-lan through the BMC). Running 'dmesg' inside crash shows the panic happened at 1643703, but there were 1611 messages logged at time 1643640. That's the last entry we see on our central logging server. Was the log volume so high or the serial console so slow, that it was able to deny access to the spinlock for 10s?

Note, we had console_ratelimit disabled due to ~~LU-7867~~ to make sure we see all the problematic objects. If my above hypothesis is true, then enabling console_ratelimit would have likely prevented the crash. Is that just an unsafe option, or is there something we can do to improve it?

Hard lockup from debug logging

Details

Description

Attachments

Activity

People

Dates