[LU-7886] Hard lockup from debug logging Created: 18/Mar/16  Updated: 18/Apr/16  Resolved: 18/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Matt Ezell Assignee: Jian Yu
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

An OSS panic'ed because a watchdog detected a hard lockup.

[1643703.194240] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[1643703.194243] Pid: 42640, comm: libcfs_debug_du Tainted: G           --L------------    2.6.32-573.12.1.el6.atlas.x86_64 #1
[1643703.194245] Call Trace:
[1643703.194246]  <NMI>  [<ffffffff81538ae1>] ? panic+0xa7/0x16f
[1643703.194257]  [<ffffffff810149c9>] ? sched_clock+0x9/0x10
[1643703.194261]  [<ffffffff810ed8ad>] ? watchdog_overflow_callback+0xcd/0xd0
[1643703.194264]  [<ffffffff81124427>] ? __perf_event_overflow+0xa7/0x240
[1643703.194267]  [<ffffffff8101dc54>] ? x86_perf_event_set_period+0xf4/0x180
[1643703.194270]  [<ffffffff81124a74>] ? perf_event_overflow+0x14/0x20
[1643703.194272]  [<ffffffff81024a02>] ? intel_pmu_handle_irq+0x202/0x3f0
[1643703.194276]  [<ffffffff8153dde9>] ? perf_event_nmi_handler+0x39/0xb0
[1643703.194278]  [<ffffffff8153f8a5>] ? notifier_call_chain+0x55/0x80
[1643703.194284]  [<ffffffff8153f90a>] ? atomic_notifier_call_chain+0x1a/0x20
[1643703.194287]  [<ffffffff810a783e>] ? notify_die+0x2e/0x30
[1643703.194289]  [<ffffffff8153d563>] ? do_nmi+0x1c3/0x350
[1643703.194292]  [<ffffffff8153ce20>] ? nmi+0x20/0x30
[1643703.194295]  [<ffffffff8153c5b8>] ? _spin_lock_irq+0x28/0x40
[1643703.194296]  <<EOE>>  [<ffffffffa0461445>] ? cfs_trace_lock_tcd+0x95/0xa0 [libcfs]
[1643703.194318]  [<ffffffffa047202b>] ? collect_pages+0x25b/0x290 [libcfs]
[1643703.194327]  [<ffffffffa047270c>] ? cfs_tracefile_dump_all_pages+0x5c/0x2e0 [libcfs]
[1643703.194330]  [<ffffffff81538bea>] ? printk+0x41/0x47
[1643703.194338]  [<ffffffffa046c720>] ? libcfs_debug_dumplog_thread+0x0/0x30 [libcfs]
[1643703.194347]  [<ffffffffa046c705>] ? libcfs_debug_dumplog_internal+0xb5/0xd0 [libcfs]
[1643703.194356]  [<ffffffffa046c72e>] ? libcfs_debug_dumplog_thread+0xe/0x30 [libcfs]
[1643703.194360]  [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
[1643703.194363]  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
[1643703.194365]  [<ffffffff810a0f30>] ? kthread+0x0/0xc0
[1643703.194367]  [<ffffffff8100c280>] ? child_rip+0x0/0x20

CPU2 was spinning on a tcd spinlock. We got a crashdump, here's the register dump:

#12 [ffff880044646f50] nmi at ffffffff8153ce20
    [exception RIP: _spin_lock_irq+40]
    RIP: ffffffff8153c5b8  RSP: ffff8803e8187da0  RFLAGS: 00000097
    RAX: 00000000000003b1  RBX: ffff88080ce90600  RCX: 0000000000000600
    RDX: 00000000000003b0  RSI: 0000000000000001  RDI: ffff88080ce90600
    RBP: ffff8803e8187da0   R8: ffffffff81c17c00   R9: 0000000000000000
    R10: 0000000000000010  R11: 0000000000000001  R12: 0000000000000001
    R13: ffff88080ce90000  R14: ffffffffa0497e88  R15: ffff88080ce90600
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#13 [ffff8803e8187da0] _spin_lock_irq at ffffffff8153c5b8

Which tcd lock was it spinning on? It's in spin_lock_irq so we know the TCD type was 0. Let's grab our 16 lock addresses

crash> p &cfs_trace_data[0][0][0]->tcd.tcd_lock
$21 = (spinlock_t *) 0xffff88080ce90000
crash> p &cfs_trace_data[0][0][1]->tcd.tcd_lock
$22 = (spinlock_t *) 0xffff88080ce90080
crash> p &cfs_trace_data[0][0][2]->tcd.tcd_lock
$23 = (spinlock_t *) 0xffff88080ce90100
crash> p &cfs_trace_data[0][0][3]->tcd.tcd_lock
$24 = (spinlock_t *) 0xffff88080ce90180
crash> p &cfs_trace_data[0][0][4]->tcd.tcd_lock
$25 = (spinlock_t *) 0xffff88080ce90200
crash> p &cfs_trace_data[0][0][5]->tcd.tcd_lock
$26 = (spinlock_t *) 0xffff88080ce90280
crash> p &cfs_trace_data[0][0][6]->tcd.tcd_lock
$27 = (spinlock_t *) 0xffff88080ce90300
crash> p &cfs_trace_data[0][0][7]->tcd.tcd_lock
$28 = (spinlock_t *) 0xffff88080ce90380
crash> p &cfs_trace_data[0][0][8]->tcd.tcd_lock
$29 = (spinlock_t *) 0xffff88080ce90400
crash> p &cfs_trace_data[0][0][9]->tcd.tcd_lock
$30 = (spinlock_t *) 0xffff88080ce90480
crash> p &cfs_trace_data[0][0][10]->tcd.tcd_lock
$31 = (spinlock_t *) 0xffff88080ce90500
crash> p &cfs_trace_data[0][0][11]->tcd.tcd_lock
$32 = (spinlock_t *) 0xffff88080ce90580
crash> p &cfs_trace_data[0][0][12]->tcd.tcd_lock
$33 = (spinlock_t *) 0xffff88080ce90600
crash> p &cfs_trace_data[0][0][13]->tcd.tcd_lock
$34 = (spinlock_t *) 0xffff88080ce90680
crash> p &cfs_trace_data[0][0][14]->tcd.tcd_lock
$35 = (spinlock_t *) 0xffff88080ce90700
crash> p &cfs_trace_data[0][0][15]->tcd.tcd_lock
$36 = (spinlock_t *) 0xffff88080ce90780

The lock address for CPU12 (0xffff88080ce90600) is in several registers, so it seems a likely culprit. What's he doing?

PID: 13581  TASK: ffff880e348baab0  CPU: 12  COMMAND: "ll_ost_io03_058"
 #0 [ffff88089c486e90] crash_nmi_callback at ffffffff81033cf6
 #1 [ffff88089c486ea0] notifier_call_chain at ffffffff8153f8a5
 #2 [ffff88089c486ee0] atomic_notifier_call_chain at ffffffff8153f90a
 #3 [ffff88089c486ef0] notify_die at ffffffff810a783e
 #4 [ffff88089c486f20] do_nmi at ffffffff8153d563
 #5 [ffff88089c486f50] nmi at ffffffff8153ce20
    [exception RIP: native_read_tsc+6]
    RIP: ffffffff810156e6  RSP: ffff880e332b3720  RFLAGS: 00000046
    RAX: 00000000433eaf0e  RBX: 000000000000000c  RCX: 00000000433eaed6
    RDX: 000000000020692a  RSI: 000000000027aacd  RDI: 0000000000000a28
    RBP: ffff880e332b3720   R8: 0000000000eb986a   R9: ffff8800000bd000
    R10: 0000000000000000  R11: 0000000000000198  R12: 00000000433ea79a
    R13: 000000000000000c  R14: 0000000000000a28  R15: ffffffff8136aec0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880e332b3720] native_read_tsc at ffffffff810156e6
 #7 [ffff880e332b3728] delay_tsc at ffffffff8129d2ba
 #8 [ffff880e332b3758] __const_udelay at ffffffff8129d266
 #9 [ffff880e332b3768] wait_for_xmitr at ffffffff8136ae42
#10 [ffff880e332b3798] serial8250_console_putchar at ffffffff8136aee6
#11 [ffff880e332b37b8] uart_console_write at ffffffff8136688e
#12 [ffff880e332b37f8] serial8250_console_write at ffffffff8136b24d
#13 [ffff880e332b3848] __call_console_drivers at ffffffff810775f5
#14 [ffff880e332b3878] _call_console_drivers at ffffffff8107765a
#15 [ffff880e332b3898] release_console_sem at ffffffff81077ca8
#16 [ffff880e332b38d8] vprintk at ffffffff810783a8
#17 [ffff880e332b3978] printk at ffffffff81538bea
#18 [ffff880e332b39d8] cfs_print_to_console at ffffffffa0461114 [libcfs]
#19 [ffff880e332b3a08] libcfs_debug_vmsg2 at ffffffffa04716ae [libcfs]
#20 [ffff880e332b3b78] _debug_req at ffffffffa0753823 [ptlrpc]
#21 [ffff880e332b3da8] ptlrpc_main at ffffffffa0764b42 [ptlrpc]
#22 [ffff880e332b3ee8] kthread at ffffffff810a0fce
#23 [ffff880e332b3f48] kernel_thread at ffffffff8100c28a

CPU12 is running libcfs_debug_vmsg2 so he would be holding a local tcd lock. He's sending data out on the serial console (it's serial-over-lan through the BMC). Running 'dmesg' inside crash shows the panic happened at 1643703, but there were 1611 messages logged at time 1643640. That's the last entry we see on our central logging server. Was the log volume so high or the serial console so slow, that it was able to deny access to the spinlock for 10s?

Note, we had console_ratelimit disabled due to LU-7867 to make sure we see all the problematic objects. If my above hypothesis is true, then enabling console_ratelimit would have likely prevented the crash. Is that just an unsafe option, or is there something we can do to improve it?



 Comments   
Comment by Andreas Dilger [ 18/Mar/16 ]

When logging to a slow serial console this can disable interrupts on the whole system while the output is ongoing, and can definitely cause major system hangs. Disabling the console message rate limiting is making this problem worse, and is not recommended. If the OSS is not crashing then it would be much better to run lctl debug_daemon (with appropriate options for max file size) to log all of the debug messages to disk instead of trying to pipe this via serial console.

Comment by Bruno Faccini (Inactive) [ 22/Mar/16 ]

I fully agree with Andreas comment that disabling Console message rate limiting is a bad idea, but based on my previous experiences with similar problems, I also want to add that this much more dangerous when using some slow/buggy Console data-path (I2C/GPIO HW, BMC/IPMI FW, ...), so you may also want to check about this issue with you HW provider and verify there is no FW update to fix related issues.

Comment by Jian Yu [ 31/Mar/16 ]

Hi Matt,

I saw the following meeting note about this ticket:

It would be better if we use the debug daemon, but it doesn't work.

Could you please comment what the problem you hit while using lctl debug_daemon? Thank you.

FYI, here are the instructions for using lctl debug_daemon: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#idm140466054410624

Comment by Jian Yu [ 18/Apr/16 ]

Let's close this ticket because the issue is expected behavior. New tickets will be created when new problems occur next time.

Generated at Sat Feb 10 02:12:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.