[LU-1099] Lustre OSS OOMs repeatedly Created: 13/Feb/12 Updated: 28/Feb/18 Resolved: 28/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre 2.1.0-21chaos (github.com/chaos/lustre) |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 10139 | ||||||||
| Description |
|
Today we had an IB-connected 2.1 server OOM out of the blue. After rebooting the node, the OSS OOMs again after it is in recovery for a little while. This OSS servers 15 OSTs. With one of the reboots, we added the "malloc" debug level and used the debug daemon to collect a log. Note that we saw messages about dropped log messages, so be aware that lines are missing in there. I will upload that to the ftp site, as it is too large fir jira. Filename will be sumom31-lustre.log.txt.bz2. We also extracted a lustre log at our default logging level from the original crash dump after the first oom. I will attach that here. Note also that we have the obdfilter writethrough and read caches disabled at this time. Using the crash "kmem" command, it is clear that most of the memory is used in slab, but not attributed to any of the Lustre named slabs. Here is the short kmem -i: crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 6117058 23.3 GB ----
FREE 37513 146.5 MB 0% of TOTAL MEM
USED 6079545 23.2 GB 99% of TOTAL MEM
SHARED 3386 13.2 MB 0% of TOTAL MEM
BUFFERS 3297 12.9 MB 0% of TOTAL MEM
CACHED 26240 102.5 MB 0% of TOTAL MEM
SLAB 5908658 22.5 GB 96% of TOTAL MEM
The OSS has 24G of RAM total. The biggest consumers by far are size-8192, size-1024, and size-2048. I will attach the full "kmem -s" output as well. We attempted to work around the problem by starting one OST at a time, and allowing it to fully recover before starting the next OST. By the end of the third OST's recovery, memory usage was normal. During the fourth OST's recovery the memory usage spiked and the node OOMed. We finally gave up and mounted with the abort_recovery option, and things seem to be running fine at the moment. |
| Comments |
| Comment by Oleg Drokin [ 14/Feb/12 ] |
|
Is this one with server side read and write-through caches disabled? |
| Comment by Christopher Morrone [ 14/Feb/12 ] |
|
That is what I reported, yes. But now that I think about it, the caches were probably enabled still on this server cluster when we saw the first occurance of an OOM. The admins are using lctl set_param to disable the caches from a lustre script after mount occurs. That should be fairly early in the recovery process. |
| Comment by Oleg Drokin [ 15/Feb/12 ] |
|
Ah, indeed I missed the caches disabled line somehow. Anyway the default log does not have allocation information collected unfortuantely so it's hard to know what was actually allocating. I wonder what sort of an allocation we might have to the connected clients other than ldlm locks and nothing obvious comes to mind, esp. at these sizes. You y you'll upload the log with malloc debug under the name of sumom31-lustre.log.txt.bz2 but it;'s nowhere to be found on our ftp. Any chance I can take a look at it? |
| Comment by Christopher Morrone [ 15/Feb/12 ] |
|
Hmmm, I could have sworn that I uploaded the file. Maybe I got distracted. Sorry about that. I just pushed sumom31-lustre.log.txt.bz2 to the ftp uploads directory. |
| Comment by Oleg Drokin [ 15/Feb/12 ] |
|
Ok, so the huge number of 8k allocations comes from rqbd allocs and they are never freed (during the log anyway). |
| Comment by Liang Zhen (Inactive) [ 05/Jun/12 ] |
|
Although ptlrpc service sets "Lazy" for request portal, which means request could be blocked inside LNet layer (instead of dropping), but ptlrpc service will always try to allocate enough buffer (rqbd) and ptlrpc_request for incoming requests, these rqbds will not be freed until shutting down service, this is kind of thing we might improve, but my question is, is it expected that we have so many pending requests on this OSS? ffff88033fd304c0 size-8192 8192 1413671 1413672 1413672 8k ffff88033fcd0340 size-1024 1024 8298628 8298832 2074708 4k I remember each ptlrpc_request should take about 900 bytes, so I think most of these 1K objects are ptlrpc_requests, and there are over 8 million requests at here... |
| Comment by Niu Yawei (Inactive) [ 05/Jun/12 ] |
|
Was the OOM triggered during recovery or after recovery? from the log (sumom31_oss_after_reboot_lustre.log.txt.bz2), I see all the 15 OSTs have finished recovery. The huge rqbd buffer (grow only buffer) could be caused by request bursting during recovery, and only lock replay isn't throttled during recovery (resend after recovery isn't throttled as well, but I don't think there are many requests to be resend). Are there huge amount of cached write locks on client before the OSS reboot? |