|
False oom_killer invoked:
0]@[0x200002ce9:0xd247:0x0]] > 2012-03-15T00:43:50.560796-05:00 c0-0c1s0n2 LustreError: 25992:0:(osc_lock.c:1101:osc_lock_enqueue_wait()) osc@ffff88020e1faa98: (null) 00000000 0x0 0 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 > 2012-03-15T00:43:50.590992-05:00 c0-0c1s0n2 LustreError: 25992:0:(osc_lock.c:1101:osc_lock_enqueue_wait()) } lock@ffff8802044d5b38
> 2012-03-15T00:43:50.591012-05:00 c0-0c1s0n2 LustreError: 25992:0:(osc_lock.c:1101:osc_lock_enqueue_wait()) queuing.
> 2012-03-15T00:43:50.591038-05:00 c0-0c1s0n2 Pid: 25992, comm: cancer
> 2012-03-15T00:43:50.591051-05:00 c0-0c1s0n2 Call Trace:
> 2012-03-15T00:43:50.591062-05:00 c0-0c1s0n2 [<ffffffff81006141>] try_stack_unwind+0x151/0x190
> 2012-03-15T00:43:50.591074-05:00 c0-0c1s0n2 [<ffffffff81004b04>] dump_trace+0x84/0x440
> 2012-03-15T00:43:50.591095-05:00 c0-0c1s0n2 [<ffffffffa012a862>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
> 2012-03-15T00:43:50.621270-05:00 c0-0c1s0n2 [<ffffffffa04dcf43>] osc_lock_enqueue+0x7b3/0x8f0 [osc]
> 2012-03-15T00:43:50.621292-05:00 c0-0c1s0n2 [<ffffffffa022351b>] cl_enqueue_try+0xfb/0x370 [obdclass]
> 2012-03-15T00:43:50.621340-05:00 c0-0c1s0n2 [<ffffffffa0539285>] lov_lock_enqueue+0x195/0x800 [lov]
> 2012-03-15T00:43:50.621353-05:00 c0-0c1s0n2 [<ffffffffa022351b>] cl_enqueue_try+0xfb/0x370 [obdclass]
> 2012-03-15T00:43:50.621374-05:00 c0-0c1s0n2 [<ffffffffa0224fc7>] cl_enqueue_locked+0x77/0x1e0 [obdclass]
> 2012-03-15T00:43:50.621386-05:00 c0-0c1s0n2 [<ffffffffa0225339>] cl_lock_request+0x99/0x1d0 [obdclass]
> 2012-03-15T00:43:50.650983-05:00 c0-0c1s0n2 [<ffffffffa022a063>] cl_io_lock+0x373/0x610 [obdclass]
> 2012-03-15T00:43:50.651012-05:00 c0-0c1s0n2 [<ffffffffa022a3f3>] cl_io_loop+0xf3/0x1e0 [obdclass]
> 2012-03-15T00:43:50.676628-05:00 c0-0c1s0n2 [<ffffffffa05bb0bf>] ll_fault0+0x18f/0x280 [lustre]
> 2012-03-15T00:43:50.676651-05:00 c0-0c1s0n2 [<ffffffffa05bb1f6>] ll_fault+0x46/0x140 [lustre]
> 2012-03-15T00:43:50.676701-05:00 c0-0c1s0n2 [<ffffffff810f8586>] __do_fault+0x76/0x550
> 2012-03-15T00:43:50.676716-05:00 c0-0c1s0n2 [<ffffffff810f8aff>] handle_pte_fault+0x9f/0xcc0
> 2012-03-15T00:43:50.676729-05:00 c0-0c1s0n2 [<ffffffff810f98ce>] handle_mm_fault+0x1ae/0x240
> 2012-03-15T00:43:50.702322-05:00 c0-0c1s0n2 [<ffffffff810241d9>] do_page_fault+0x189/0x400
> 2012-03-15T00:43:50.702346-05:00 c0-0c1s0n2 [<ffffffff812e5adf>] page_fault+0x1f/0x30
> 2012-03-15T00:43:50.702359-05:00 c0-0c1s0n2 [<000000000040f51a>] 0x40f51a
> 2012-03-15T00:43:50.702406-05:00 c0-0c1s0n2 cancer invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
The kernel ended up paniced:
crash> bt
PID: 6747 TASK: ffff8801f42fd180 CPU: 0 COMMAND: "cancer"
#0 [ffff8801fd48d9e0] crash_kexec at ffffffff8107ed1d
#1 [ffff8801fd48dab0] panic at ffffffff812e2b33
#2 [ffff8801fd48db30] oom_kill_task at ffffffff810d5fff
#3 [ffff8801fd48db80] oom_kill_process at ffffffff810d6b35
#4 [ffff8801fd48dbe0] out_of_memory at ffffffff810d7268
#5 [ffff8801fd48dd70] pagefault_out_of_memory at ffffffff810d748d
#6 [ffff8801fd48dd80] mm_fault_error at ffffffff81023f56
#7 [ffff8801fd48de40] do_page_fault at ffffffff81024440
#8 [ffff8801fd48df50] page_fault at ffffffff812e5adf
RIP: 000000000041ac9c RSP: 00007fffffffaf00 RFLAGS: 00010202
RAX: 0000000000000000 RBX: 00002aaaab2b8000 RCX: 00007fffffffadb8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b
RBP: 00007fffffffaf40 R8: 00007fffffffae50 R9: 00007fffffffadb0
R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000000003
R13: 0000000000869d70 R14: 0000000000000000 R15: 00000000007f8920
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
but with plenty of memory:
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 8270938 31.6 GB ----
FREE 8039531 30.7 GB 97% of TOTAL MEM <<<
USED 231407 903.9 MB 2% of TOTAL MEM
SHARED 1071 4.2 MB 0% of TOTAL MEM
BUFFERS 0 0 0% of TOTAL MEM
CACHED 5432 21.2 MB 0% of TOTAL MEM
SLAB 32866 128.4 MB 0% of TOTAL MEM
Lustre uses VM_FAULT_ERROR as a return value
in its fault handler. VM_FAULT_ERROR is defined as:
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
VM_FAULT_HWPOISON_LARGE)
Since this includes VM_FAULT_OOM, the kernel reacts accordingly even though OOM
isn't really happening (similarly, there are no hwpoison activities occurring).
Lustre really shouldn't be using VM_FAULT_ERROR as a return value. That is
used by the kernel to be able to share common code for a variety of fault
failures, but the expectation is that any given fault handle should report back
a single error so the kernel can react accordingly base on the error type.
|