[LU-2860] Mmap() causing OOM Created: 25/Feb/13  Updated: 25/Feb/13  Resolved: 25/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Peter Behroozi Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

RHEL Server 6.4, kernel 2.6.32-279.9.1.el6.x86_64, 32GB ECC RAM, 16GB swap, infiniband


Attachments: Text File oom.txt     File test_mmap.c    
Severity: 2
Rank (Obsolete): 6926

 Description   

When using mmap() on files on a Lustre FS (v2.1.2), the Linux kernel sometimes invokes the out-of-memory killer, even when the system has most of its memory free. The attached kernel OOM log shows an example of a system that crashed with >20GB memory free and 99% of the 16GB of swap unused.

It can take several hours before the OOM killer is triggered under normal mmap() usage. To help with debugging, I've found that the following source code can cause an instantaneous OOM condition if run from a directory on a Lustre FS, even though mmap() is not correctly used in the source code. (The appropriate behavior would be for the kernel to terminate the code, not cause an OOM condition). This may help identify the problem in the code path which is causing the main OOM issue in production mmap() usage.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <inttypes.h>

int main(void) {
  int64_t i, j, size=0, memsize=0;
  FILE *o;
  int fd;
  void *mem = NULL, *mmem = NULL;
  o = fopen("test.dat", "w+");
  fd = fileno(o);
  for (i=0; i<1e6; i++) {
    size += 100000; //Allocates up to 100GB for mmap()ed file
    memsize += 3000; //Allocates up to 3GB for in-memory usage
    //ftruncate(fd, size);  //<--- this would be the correct usage
    mmem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    mem = realloc(mem, memsize);
    memset(mem, i%256, memsize);
    memset(mmem+size-100000, i%256, 100000);
    munmap(mmem, size);
  }
  return 0;
}


 Comments   
Comment by Peter Behroozi [ 25/Feb/13 ]

Source code which will cause an OOM condition.

Comment by Jinshan Xiong (Inactive) [ 25/Feb/13 ]

This program should fail with SIGBUS. The fault handling in 2.1.2 was wrong and please upgrade it to 2.1.4.

Generated at Sat Feb 10 01:28:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.