[LU-7432] oom-killer started on MDSes Created: 16/Nov/15  Updated: 24/Nov/15  Resolved: 24/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: tip of master(df6cf859bbb29392064e6ddb701f3357e01b3a13) + patches


Attachments: File console-lola-10.log.gz     File console-lola-11.log.gz     File console-lola-9.log.gz     File messages-lola-10.log.bz2     File messages-lola-11.log.bz2     File messages-lola-9.log.bz2    
Issue Links:
Related
is related to LU-7455 Tracking tickets to make DNE pass soa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'.
DNE is enabled. OSTs had been formatted using zfs, MDTs using ldiskfs. MDS nodes are configured in HA active-active failover configuration.

At three moments in time:

date node build ID soak event  
Nov 9 18:10:01 lola-9 build: 20151109 no fault; only job execution
Nov 13 14:30:02 lola-10 build 20151113 during stopping of soak
Nov 14 05:35:01 lola-11 build 20151113 no fault ; only job execution
Nov 14 05:45:01 {{ lola-9}} build 20151113 no fault ; only job execution

the oom - killer had been invoked on the nodes specified. (All events happened at times where no fault was injected.)

Attached files: console and syslog of nodes affected.

Unfortunately collectl wasn't running to gather performance counters.
The tool has been enabled on all soak nodes to be able get memory, especially slab stats during one of the next sessions.



 Comments   
Comment by Oleg Drokin [ 16/Nov/15 ]

we need /proc/slabinfo output here to see who's using the ram.

Comment by Peter Jones [ 20/Nov/15 ]

Frank

When do you think that you will be able to provide the info that Oleg has requested?

Thanks

Peter

Comment by Frank Heckes (Inactive) [ 23/Nov/15 ]

Sorry, I overlooked Oleg's request.
For the incident described above we don't have these counters. I enabled collectl on all soak nodes
last week to gather performance counters, especially for the slap details. We should be prepared to
replay the stats in case the incident happens again.

Comment by Di Wang [ 24/Nov/15 ]

I checked the log and also was monitoring the MDS when OOM was about to happen. It seems because of endless recovery on some MDTs. i.e. if recovery abort problem is being fixed, then this problem should go away. Since the endless recovery will be fixed by the patch in LU-7039 and other related patch under LU-7455, I will close this patch.

Frank, if you see something different, please re-open this one. thanks.

Comment by Di Wang [ 24/Nov/15 ]

duplicate with LU-7039

Generated at Sat Feb 10 02:08:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.