[LU-1325] loading large enough binary from lustre trigger OOM killer during page_fault while a large amount of memory is available Created: 16/Apr/12  Updated: 07/Jun/12  Resolved: 07/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexandre Louvet Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File trace.107493.txt.gz    
Severity: 3
Rank (Obsolete): 6414

 Description   

While loading a large enough binary, we hit OOM during page_fault while the system have still a lot of free memory available (in our case we still have 60 GB of free memory on a node with 64 GB installed).

The problem doesn't popup is the binary is not big enough and if there isn't enough concurrency. A simple ls works, a small program too, but if the size increase to few MB with some DSO around and the binary is run with mpirun, the page_fault looks interrupted by a signal into cl_lock_state_wait then the error code return up to ll_fault0 where is it replaced by a VM_FAULT_ERROR which trigger the OOM.

Here is the extract from the trace collected (and attached) :
(cl_lock.c:986:cl_lock_state_wait()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_lock.c:1310:cl_enqueue_locked()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_lock.c:2175:cl_lock_request()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:393:cl_lockset_lock_one()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:444:cl_lockset_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:479:cl_io_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:1033:cl_io_loop()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(llite_mmap.c:298:ll_fault0()) Process leaving (rc=51 : 51 : 33)

We are able to reproduce the problem at will, by scheduling through the batch scheduler a mpi job of 32 cores, 2 nodes (16 cores per nodes) on the customer system. I hasn't been able to reproduce it on an another system.

I also tried to retrieve the culprit signal by setting panic_on_oom, but unfortunately it seems to have been cleared during the oom handling. Strac'ing is too complicated with the mpi layer.

Alex.



 Comments   
Comment by Peter Jones [ 16/Apr/12 ]

Jinshan will look into this one

Comment by Jinshan Xiong (Inactive) [ 18/Apr/12 ]

Please try patch http://review.whamcloud.com/2574.

Comment by Bruno Faccini (Inactive) [ 15/May/12 ]

A nasty side-effect/consequence of this problem is that it often (always ??) leaves processes stuck on at least one mm_struct->mmap_sem when the owner of the semaphore is impossible to find.

This may come from a hole/bug in OOM algorithm allowing a process to either take the semaphore and leave or self-deadlock on it ...

The bad thing is that finally an affected node has to be re-booted since commands like "ps/pidof/swapoff/..." also block for ever on these semaphores.

Comment by Jinshan Xiong (Inactive) [ 15/May/12 ]

Hi Bruno, which version of patch are you running? I saw this problem in earlier versions but it should have been fixed in patch set 7.

Comment by Bruno Faccini (Inactive) [ 23/May/12 ]

I will ask our/Bull integration team and let you know.

Comment by Peter Jones [ 04/Jun/12 ]

Bruno

Any answer on this yet? Can we mark this as a duplicate of LU-1299?

Peter

Comment by Alexandre Louvet [ 07/Jun/12 ]

To answer Jinshan question, we never got any patch from this Jira (nor LU-1299).
On our side, this LU is a duplicate of LU-1299 for a while, so yes you can mark it as a duplicate of LU-1299.

Alex.

Comment by Peter Jones [ 07/Jun/12 ]

ok thanks Alexandre.

Generated at Sat Feb 10 01:15:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.