[LU-6173] CPU stalled with obd_zombid running - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3
Labels:
None
Environment:
Git repo can be found at https://github.com/jlan/lustre-nas
Server: centos 6.4 2.6.32-358.23.2.el6, lustre 2.4.3-12nasS
Client: sles11sp3 3.0.101-0.31.1, lustre 2.4.3-11nasC

Severity:
3
Rank (Obsolete):
17274

Description

Yesterday experienced a network problem. Consequently, we had a number of clients stalled. At least four were hanged in this situation. We captured a vmcore on one of the systems.

Console logs showed one of the CPUs was detected to stall:
"INFO: rcu_sched_state detected stall on CPU 9."

All CPU's at r305i7n2 except CPU 9 were running migration process and
the rcu_sched_state detected CPU was running obd_zombid.
The console logs of other three systems confirmed the stalled CPU were
running obd_zombid also, but without vmcore I can not say for sure that
other CPU's were running 'migration' as r305i7n2 did.

The stack trace is:

PID: 5070 TASK: ffff88046f086300 CPU: 9 COMMAND: "obd_zombid"
#0 [ffff88087fc27e40] crash_nmi_callback at ffffffff810245af
#1 [ffff88087fc27e50] notifier_call_chain at ffffffff81475847
#2 [ffff88087fc27e80] __atomic_notifier_call_chain at ffffffff8147588d
#3 [ffff88087fc27e90] notify_die at ffffffff814758dd
#4 [ffff88087fc27ec0] default_do_nmi at ffffffff81472d37
#5 [ffff88087fc27ee0] do_nmi at ffffffff81472f68
#6 [ffff88087fc27ef0] restart_nmi at ffffffff814724b1
[exception RIP: native_halt+1]
RIP: ffffffff810300b1 RSP: ffff88087fc23de0 RFLAGS: 00000082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000080f
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
RBP: ffff88046d96fd78 R8: 0000000000000150 R9: ffffe8ffffc20738
R10: 0000000000000006 R11: ffffffff8102b430 R12: 0000000000000000
R13: 0000000000000006 R14: 0000000000000006 R15: 00000000fffffffb
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#7 [ffff88087fc23de0] native_halt at ffffffff810300b1
#8 [ffff88087fc23de0] halt_current_cpu at ffffffff81024959
#9 [ffff88087fc23df0] lkdb_main_loop at ffffffff812548ec
#10 [ffff88087fc23ef0] kdba_main_loop at ffffffff8139bef2
#11 [ffff88087fc23f20] kdb at ffffffff8125199f
#12 [ffff88087fc23f80] kdb_ipi at ffffffff8124ea07
#13 [ffff88087fc23f90] smp_kdb_interrupt at ffffffff8139b656
#14 [ffff88087fc23fb0] kdb_interrupt at ffffffff8147aca3
— <IRQ stack> —
#15 [ffff88046d96fd78] kdb_interrupt at ffffffff8147aca3
[exception RIP: _raw_spin_lock+24]
RIP: ffffffff81471a88 RSP: ffff88046d96fe28 RFLAGS: 00000206
RAX: 0000000000001700 RBX: ffff880867d28810 RCX: ffff880856c3be00
RDX: 0000000000008000 RSI: ffff880856c3be00 RDI: ffff880430b100f8
RBP: ffff880864634078 R8: 0000000000000002 R9: 0000000000000000
R10: 0000000010000008 R11: 0000000000000000 R12: ffffffff8147ac9e
R13: ffffffff811458be R14: ffff880867d28810 R15: 0000000000000206
ORIG_RAX: ffffffffffffff01 CS: 0010 SS: 0018
#16 [ffff88046d96fe28] osc_cleanup at ffffffffa0a48829 [osc]
#17 [ffff88046d96fe38] class_decref at ffffffffa076eed4 [obdclass]
#18 [ffff88046d96fea8] class_export_destroy at ffffffffa074c1de [obdclass]
#19 [ffff88046d96fec8] obd_zombie_impexp_cull at ffffffffa074c61d [obdclass]
#20 [ffff88046d96fee8] obd_zombie_impexp_thread at ffffffffa074c7bd [obdclass]
#21 [ffff88046d96ff48] kernel_thread_helper at ffffffff8147aae4

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU6173.crash-analysis.tgz
14 kB
05/Feb/15 2:48 AM
r305i7n2-20150128.bz2
313 kB
03/Feb/15 7:10 PM

Issue Links

mentioned in: Page Loading...

Activity

People

Assignee:: Emoly Liu

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Jan/15 11:52 PM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 25/May/15 10:41 PM