Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.0
Labels:
- llnl
Environment:
Lustre 2.1.0-21chaos (github.com/chaos/lustre)

Severity:
3
Rank (Obsolete):
10139

Description

Today we had an IB-connected 2.1 server OOM out of the blue. After rebooting the node, the OSS OOMs again after it is in recovery for a little while. This OSS servers 15 OSTs.

With one of the reboots, we added the "malloc" debug level and used the debug daemon to collect a log. Note that we saw messages about dropped log messages, so be aware that lines are missing in there. I will upload that to the ftp site, as it is too large fir jira. Filename will be sumom31-lustre.log.txt.bz2.

We also extracted a lustre log at our default logging level from the original crash dump after the first oom. I will attach that here.

Note also that we have the obdfilter writethrough and read caches disabled at this time.

Using the crash "kmem" command, it is clear that most of the memory is used in slab, but not attributed to any of the Lustre named slabs. Here is the short kmem -i:

crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  6117058      23.3 GB         ----
      FREE    37513     146.5 MB    0% of TOTAL MEM
      USED  6079545      23.2 GB   99% of TOTAL MEM
    SHARED     3386      13.2 MB    0% of TOTAL MEM
   BUFFERS     3297      12.9 MB    0% of TOTAL MEM
    CACHED    26240     102.5 MB    0% of TOTAL MEM
      SLAB  5908658      22.5 GB   96% of TOTAL MEM

The OSS has 24G of RAM total.

The biggest consumers by far are size-8192, size-1024, and size-2048. I will attach the full "kmem -s" output as well.

We attempted to work around the problem by starting one OST at a time, and allowing it to fully recover before starting the next OST. By the end of the third OST's recovery, memory usage was normal. During the fourth OST's recovery the memory usage spiked and the node OOMed.

We finally gave up and mounted with the abort_recovery option, and things seem to be running fine at the moment.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

kmem_s_first_time.txt
19 kB
13/Feb/12 9:26 PM
sumom31_oss_after_reboot_lustre.log.txt.bz2
1.97 MB
13/Feb/12 9:26 PM

Issue Links

duplicates

LU-9372 OOM happens on OSS during Lustre recovery for more than 5000 clients

Resolved

Activity

People

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/Feb/12 9:26 PM

Updated:: 28/Feb/18 8:27 PM

Resolved:: 28/Feb/18 8:27 PM