[LU-7886] Hard lockup from debug logging Created: 18/Mar/16 Updated: 18/Apr/16 Resolved: 18/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Matt Ezell | Assignee: | Jian Yu |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.5.5-g1241c21-CHANGED-2.6.32-573.12.1.el6.atlas.x86_64 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
An OSS panic'ed because a watchdog detected a hard lockup. [1643703.194240] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 [1643703.194243] Pid: 42640, comm: libcfs_debug_du Tainted: G --L------------ 2.6.32-573.12.1.el6.atlas.x86_64 #1 [1643703.194245] Call Trace: [1643703.194246] <NMI> [<ffffffff81538ae1>] ? panic+0xa7/0x16f [1643703.194257] [<ffffffff810149c9>] ? sched_clock+0x9/0x10 [1643703.194261] [<ffffffff810ed8ad>] ? watchdog_overflow_callback+0xcd/0xd0 [1643703.194264] [<ffffffff81124427>] ? __perf_event_overflow+0xa7/0x240 [1643703.194267] [<ffffffff8101dc54>] ? x86_perf_event_set_period+0xf4/0x180 [1643703.194270] [<ffffffff81124a74>] ? perf_event_overflow+0x14/0x20 [1643703.194272] [<ffffffff81024a02>] ? intel_pmu_handle_irq+0x202/0x3f0 [1643703.194276] [<ffffffff8153dde9>] ? perf_event_nmi_handler+0x39/0xb0 [1643703.194278] [<ffffffff8153f8a5>] ? notifier_call_chain+0x55/0x80 [1643703.194284] [<ffffffff8153f90a>] ? atomic_notifier_call_chain+0x1a/0x20 [1643703.194287] [<ffffffff810a783e>] ? notify_die+0x2e/0x30 [1643703.194289] [<ffffffff8153d563>] ? do_nmi+0x1c3/0x350 [1643703.194292] [<ffffffff8153ce20>] ? nmi+0x20/0x30 [1643703.194295] [<ffffffff8153c5b8>] ? _spin_lock_irq+0x28/0x40 [1643703.194296] <<EOE>> [<ffffffffa0461445>] ? cfs_trace_lock_tcd+0x95/0xa0 [libcfs] [1643703.194318] [<ffffffffa047202b>] ? collect_pages+0x25b/0x290 [libcfs] [1643703.194327] [<ffffffffa047270c>] ? cfs_tracefile_dump_all_pages+0x5c/0x2e0 [libcfs] [1643703.194330] [<ffffffff81538bea>] ? printk+0x41/0x47 [1643703.194338] [<ffffffffa046c720>] ? libcfs_debug_dumplog_thread+0x0/0x30 [libcfs] [1643703.194347] [<ffffffffa046c705>] ? libcfs_debug_dumplog_internal+0xb5/0xd0 [libcfs] [1643703.194356] [<ffffffffa046c72e>] ? libcfs_debug_dumplog_thread+0xe/0x30 [libcfs] [1643703.194360] [<ffffffff810a0fce>] ? kthread+0x9e/0xc0 [1643703.194363] [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [1643703.194365] [<ffffffff810a0f30>] ? kthread+0x0/0xc0 [1643703.194367] [<ffffffff8100c280>] ? child_rip+0x0/0x20 CPU2 was spinning on a tcd spinlock. We got a crashdump, here's the register dump: #12 [ffff880044646f50] nmi at ffffffff8153ce20
[exception RIP: _spin_lock_irq+40]
RIP: ffffffff8153c5b8 RSP: ffff8803e8187da0 RFLAGS: 00000097
RAX: 00000000000003b1 RBX: ffff88080ce90600 RCX: 0000000000000600
RDX: 00000000000003b0 RSI: 0000000000000001 RDI: ffff88080ce90600
RBP: ffff8803e8187da0 R8: ffffffff81c17c00 R9: 0000000000000000
R10: 0000000000000010 R11: 0000000000000001 R12: 0000000000000001
R13: ffff88080ce90000 R14: ffffffffa0497e88 R15: ffff88080ce90600
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff8803e8187da0] _spin_lock_irq at ffffffff8153c5b8
Which tcd lock was it spinning on? It's in spin_lock_irq so we know the TCD type was 0. Let's grab our 16 lock addresses crash> p &cfs_trace_data[0][0][0]->tcd.tcd_lock $21 = (spinlock_t *) 0xffff88080ce90000 crash> p &cfs_trace_data[0][0][1]->tcd.tcd_lock $22 = (spinlock_t *) 0xffff88080ce90080 crash> p &cfs_trace_data[0][0][2]->tcd.tcd_lock $23 = (spinlock_t *) 0xffff88080ce90100 crash> p &cfs_trace_data[0][0][3]->tcd.tcd_lock $24 = (spinlock_t *) 0xffff88080ce90180 crash> p &cfs_trace_data[0][0][4]->tcd.tcd_lock $25 = (spinlock_t *) 0xffff88080ce90200 crash> p &cfs_trace_data[0][0][5]->tcd.tcd_lock $26 = (spinlock_t *) 0xffff88080ce90280 crash> p &cfs_trace_data[0][0][6]->tcd.tcd_lock $27 = (spinlock_t *) 0xffff88080ce90300 crash> p &cfs_trace_data[0][0][7]->tcd.tcd_lock $28 = (spinlock_t *) 0xffff88080ce90380 crash> p &cfs_trace_data[0][0][8]->tcd.tcd_lock $29 = (spinlock_t *) 0xffff88080ce90400 crash> p &cfs_trace_data[0][0][9]->tcd.tcd_lock $30 = (spinlock_t *) 0xffff88080ce90480 crash> p &cfs_trace_data[0][0][10]->tcd.tcd_lock $31 = (spinlock_t *) 0xffff88080ce90500 crash> p &cfs_trace_data[0][0][11]->tcd.tcd_lock $32 = (spinlock_t *) 0xffff88080ce90580 crash> p &cfs_trace_data[0][0][12]->tcd.tcd_lock $33 = (spinlock_t *) 0xffff88080ce90600 crash> p &cfs_trace_data[0][0][13]->tcd.tcd_lock $34 = (spinlock_t *) 0xffff88080ce90680 crash> p &cfs_trace_data[0][0][14]->tcd.tcd_lock $35 = (spinlock_t *) 0xffff88080ce90700 crash> p &cfs_trace_data[0][0][15]->tcd.tcd_lock $36 = (spinlock_t *) 0xffff88080ce90780 The lock address for CPU12 (0xffff88080ce90600) is in several registers, so it seems a likely culprit. What's he doing? PID: 13581 TASK: ffff880e348baab0 CPU: 12 COMMAND: "ll_ost_io03_058"
#0 [ffff88089c486e90] crash_nmi_callback at ffffffff81033cf6
#1 [ffff88089c486ea0] notifier_call_chain at ffffffff8153f8a5
#2 [ffff88089c486ee0] atomic_notifier_call_chain at ffffffff8153f90a
#3 [ffff88089c486ef0] notify_die at ffffffff810a783e
#4 [ffff88089c486f20] do_nmi at ffffffff8153d563
#5 [ffff88089c486f50] nmi at ffffffff8153ce20
[exception RIP: native_read_tsc+6]
RIP: ffffffff810156e6 RSP: ffff880e332b3720 RFLAGS: 00000046
RAX: 00000000433eaf0e RBX: 000000000000000c RCX: 00000000433eaed6
RDX: 000000000020692a RSI: 000000000027aacd RDI: 0000000000000a28
RBP: ffff880e332b3720 R8: 0000000000eb986a R9: ffff8800000bd000
R10: 0000000000000000 R11: 0000000000000198 R12: 00000000433ea79a
R13: 000000000000000c R14: 0000000000000a28 R15: ffffffff8136aec0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff880e332b3720] native_read_tsc at ffffffff810156e6
#7 [ffff880e332b3728] delay_tsc at ffffffff8129d2ba
#8 [ffff880e332b3758] __const_udelay at ffffffff8129d266
#9 [ffff880e332b3768] wait_for_xmitr at ffffffff8136ae42
#10 [ffff880e332b3798] serial8250_console_putchar at ffffffff8136aee6
#11 [ffff880e332b37b8] uart_console_write at ffffffff8136688e
#12 [ffff880e332b37f8] serial8250_console_write at ffffffff8136b24d
#13 [ffff880e332b3848] __call_console_drivers at ffffffff810775f5
#14 [ffff880e332b3878] _call_console_drivers at ffffffff8107765a
#15 [ffff880e332b3898] release_console_sem at ffffffff81077ca8
#16 [ffff880e332b38d8] vprintk at ffffffff810783a8
#17 [ffff880e332b3978] printk at ffffffff81538bea
#18 [ffff880e332b39d8] cfs_print_to_console at ffffffffa0461114 [libcfs]
#19 [ffff880e332b3a08] libcfs_debug_vmsg2 at ffffffffa04716ae [libcfs]
#20 [ffff880e332b3b78] _debug_req at ffffffffa0753823 [ptlrpc]
#21 [ffff880e332b3da8] ptlrpc_main at ffffffffa0764b42 [ptlrpc]
#22 [ffff880e332b3ee8] kthread at ffffffff810a0fce
#23 [ffff880e332b3f48] kernel_thread at ffffffff8100c28a
CPU12 is running libcfs_debug_vmsg2 so he would be holding a local tcd lock. He's sending data out on the serial console (it's serial-over-lan through the BMC). Running 'dmesg' inside crash shows the panic happened at 1643703, but there were 1611 messages logged at time 1643640. That's the last entry we see on our central logging server. Was the log volume so high or the serial console so slow, that it was able to deny access to the spinlock for 10s? Note, we had console_ratelimit disabled due to |
| Comments |
| Comment by Andreas Dilger [ 18/Mar/16 ] |
|
When logging to a slow serial console this can disable interrupts on the whole system while the output is ongoing, and can definitely cause major system hangs. Disabling the console message rate limiting is making this problem worse, and is not recommended. If the OSS is not crashing then it would be much better to run lctl debug_daemon (with appropriate options for max file size) to log all of the debug messages to disk instead of trying to pipe this via serial console. |
| Comment by Bruno Faccini (Inactive) [ 22/Mar/16 ] |
|
I fully agree with Andreas comment that disabling Console message rate limiting is a bad idea, but based on my previous experiences with similar problems, I also want to add that this much more dangerous when using some slow/buggy Console data-path (I2C/GPIO HW, BMC/IPMI FW, ...), so you may also want to check about this issue with you HW provider and verify there is no FW update to fix related issues. |
| Comment by Jian Yu [ 31/Mar/16 ] |
|
Hi Matt, I saw the following meeting note about this ticket: It would be better if we use the debug daemon, but it doesn't work. Could you please comment what the problem you hit while using lctl debug_daemon? Thank you. FYI, here are the instructions for using lctl debug_daemon: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#idm140466054410624 |
| Comment by Jian Yu [ 18/Apr/16 ] |
|
Let's close this ticket because the issue is expected behavior. New tickets will be created when new problems occur next time. |