[LU-7432] oom-killer started on MDSes Created: 16/Nov/15 Updated: 24/Nov/15 Resolved: 24/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'. At three moments in time:
the oom - killer had been invoked on the nodes specified. (All events happened at times where no fault was injected.) Attached files: console and syslog of nodes affected. Unfortunately collectl wasn't running to gather performance counters. |
| Comments |
| Comment by Oleg Drokin [ 16/Nov/15 ] |
|
we need /proc/slabinfo output here to see who's using the ram. |
| Comment by Peter Jones [ 20/Nov/15 ] |
|
Frank When do you think that you will be able to provide the info that Oleg has requested? Thanks Peter |
| Comment by Frank Heckes (Inactive) [ 23/Nov/15 ] |
|
Sorry, I overlooked Oleg's request. |
| Comment by Di Wang [ 24/Nov/15 ] |
|
I checked the log and also was monitoring the MDS when OOM was about to happen. It seems because of endless recovery on some MDTs. i.e. if recovery abort problem is being fixed, then this problem should go away. Since the endless recovery will be fixed by the patch in Frank, if you see something different, please re-open this one. thanks. |
| Comment by Di Wang [ 24/Nov/15 ] |
|
duplicate with |