Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
None
-
lola:
build: tip of master + #31 of change 16383
-
3
-
9223372036854775807
Description
The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.
Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
The allocation of slabs increased continuously till ~ 31 GB till crash
MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432)
ptlrpc_cache seems to be the biggest consumer
Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
attached files containing extracted counters for memory, slab totals and per slab allocation.
Attachments
Activity
Resolution | New: Won't Fix [ 2 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Attachment | New: console-lola-9.log-20151213.gz [ 19937 ] |
Summary | Original: oom killer active after failback MDS resources | New: oom killer active after failback of MDS resources |
Attachment | New: console.log.bz2 [ 19809 ] | |
Attachment | New: memory-counter-lola-11.dat.bz2 [ 19810 ] | |
Attachment | New: messages-lola-11.log.bz2 [ 19811 ] | |
Attachment | New: slab-details-lola-11.dat.bz2 [ 19812 ] | |
Attachment | New: slab-details-one-file-per-slab.tar.bz2 [ 19813 ] | |
Attachment | New: slab-total-lola-11.dat.bz2 [ 19814 ] | |
Attachment | New: vmcore-dmesg.txt.bz2 [ 19815 ] |