Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.1.0
-
None
-
3
-
6414
Description
While loading a large enough binary, we hit OOM during page_fault while the system have still a lot of free memory available (in our case we still have 60 GB of free memory on a node with 64 GB installed).
The problem doesn't popup is the binary is not big enough and if there isn't enough concurrency. A simple ls works, a small program too, but if the size increase to few MB with some DSO around and the binary is run with mpirun, the page_fault looks interrupted by a signal into cl_lock_state_wait then the error code return up to ll_fault0 where is it replaced by a VM_FAULT_ERROR which trigger the OOM.
Here is the extract from the trace collected (and attached) :
(cl_lock.c:986:cl_lock_state_wait()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_lock.c:1310:cl_enqueue_locked()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_lock.c:2175:cl_lock_request()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:393:cl_lockset_lock_one()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:444:cl_lockset_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:479:cl_io_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(cl_io.c:1033:cl_io_loop()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
(llite_mmap.c:298:ll_fault0()) Process leaving (rc=51 : 51 : 33)
We are able to reproduce the problem at will, by scheduling through the batch scheduler a mpi job of 32 cores, 2 nodes (16 cores per nodes) on the customer system. I hasn't been able to reproduce it on an another system.
I also tried to retrieve the culprit signal by setting panic_on_oom, but unfortunately it seems to have been cleared during the oom handling. Strac'ing is too complicated with the mpi layer.
Alex.