Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1325

loading large enough binary from lustre trigger OOM killer during page_fault while a large amount of memory is available

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • Lustre 2.1.0
    • None
    • 3
    • 6414

      While loading a large enough binary, we hit OOM during page_fault while the system have still a lot of free memory available (in our case we still have 60 GB of free memory on a node with 64 GB installed).

      The problem doesn't popup is the binary is not big enough and if there isn't enough concurrency. A simple ls works, a small program too, but if the size increase to few MB with some DSO around and the binary is run with mpirun, the page_fault looks interrupted by a signal into cl_lock_state_wait then the error code return up to ll_fault0 where is it replaced by a VM_FAULT_ERROR which trigger the OOM.

      Here is the extract from the trace collected (and attached) :
      (cl_lock.c:986:cl_lock_state_wait()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_lock.c:1310:cl_enqueue_locked()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_lock.c:2175:cl_lock_request()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_io.c:393:cl_lockset_lock_one()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_io.c:444:cl_lockset_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_io.c:479:cl_io_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (cl_io.c:1033:cl_io_loop()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)
      (llite_mmap.c:298:ll_fault0()) Process leaving (rc=51 : 51 : 33)

      We are able to reproduce the problem at will, by scheduling through the batch scheduler a mpi job of 32 cores, 2 nodes (16 cores per nodes) on the customer system. I hasn't been able to reproduce it on an another system.

      I also tried to retrieve the culprit signal by setting panic_on_oom, but unfortunately it seems to have been cleared during the oom handling. Strac'ing is too complicated with the mpi layer.

      Alex.

            jay Jinshan Xiong (Inactive)
            louveta Alexandre Louvet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: