[LU-7517] oom killer active after failback of MDS resources Created: 04/Dec/15  Updated: 24/Jan/17  Resolved: 24/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: soak
Environment:

lola:
build: tip of master + #31 of change 16383


Attachments: File console-lola-10.log-20151213.gz     File console-lola-11.log.bz2     File console-lola-9.log-20151213.gz     File console.log.bz2     File lola-10-memory-counter-20151213.dat.bz2     File lola-10-one-file-per-slab.tar.bz2     File lola-10-slab-detail-counter-20151213.dat.bz2     File lola-10-slab-global-counter-20151213.dat.bz2     File lola-11-memory-counter-20151213.dat.bz2     File lola-11-one-file-per-slab.tar.bz2     File lola-11-slab-detail-counter-20151213.dat.bz2     File lola-11-slab-global-counter-20151213.dat.bz2     File lola-9-memory-counter-20151213.dat.bz2     File lola-9-one-file-per-slab.tar.bz2     File lola-9-slab-details-counter-20151213.dat.bz2     File lola-9-slab-global-counter-20151213.dat.bz2     File memory-counter-lola-11.dat.bz2     File messages-lola-10.log-20151213.bz2     File messages-lola-11.log.bz2     File messages-lola-11.log.bz2     File messages-lola-9.log-20151213.bz2     File slab-details-lola-11.dat.bz2     File slab-details-one-file-per-slab.tar.bz2     File slab-total-lola-11.dat.bz2     File vmcore-dmesg.txt.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.

Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
The allocation of slabs increased continuously till ~ 31 GB till crash
MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432)
ptlrpc_cache seems to be the biggest consumer
Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
attached files containing extracted counters for memory, slab totals and per slab allocation.



 Comments   
Comment by Frank Heckes (Inactive) [ 04/Dec/15 ]

The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7517/127.0.0.1-2015-12-04-00\:22\:36.
It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7517.

Comment by Frank Heckes (Inactive) [ 04/Dec/15 ]

No debug log files have been written.

Comment by Di Wang [ 04/Dec/15 ]

looks like lola-8 and lola-9 got OOM as well.

Comment by Di Wang [ 04/Dec/15 ]

It looks like most the memory is holden by 1M size slab

20151204 00:21:00 size-1048576 29758 31203524608 29758 31203524608 29758 31203524608 29758 31203524608 0 0
Comment by Frank Heckes (Inactive) [ 15/Dec/15 ]

The error showed up on all soak MDSes (lola-8 not reported in detail) running soak for build https://build.hpdd.intel.com/job/lustre-reviews/36192/

  • lola-9
    • 20151213 09:52:40 failback MDTS to lola finished
    • Dec 13 16:05:01 lola-9 oom-killer started
  • lola-10
    • 20151213 08:01:40 failback MDTs to lola-10 finished
    • Dec 13 11:40:02 lola-10 oom-killer started
  • lola-11
    • 20151214 02:15:00 failback of MDTs to lola-11 finished
    • Dec 14 12:05:01 lola-11 oom-killer started

From failback till start of oom-killer the size-1048576 slabs continuously increased and are the biggest memory consumers:
lola-9

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
slab-details/size-1048576.dat:20151213 16:05:40 size-1048576 23122 24245174272 23122 24245174272 23122 24245174272 23122 24245174272 0 0
slab-details/size-512.dat:20151213 16:05:40 size-512 11151621 5709629952 11152824 5710245888 1394094 5710209024 1394103 5710245888 7835648 0
slab-details/size-128.dat:20151213 16:05:40 size-128 7076064 905736192 7077360 905902080 235911 966291456 235912 966295552 1376256 0
slab-details/size-262144.dat:20151213 16:05:40 size-262144 1673 438566912 1673 438566912 1673 438566912 1673 438566912 0 0
slab-details/ptlrpc_cache.dat:20151213 16:05:40 ptlrpc_cache 376940 289489920 376940 289489920 75388 308789248 75388 308789248 204800 0
...

lola-10

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
slab-details/size-1048576.dat:20151213 11:40:40 size-1048576 29494 30926700544 29494 30926700544 29494 30926700544 29494 30926700544 0 0
slab-details/size-262144.dat:20151213 11:40:40 size-262144 1015 266076160 1015 266076160 1015 266076160 1015 266076160 0 0
slab-details/ptlrpc_cache.dat:20151213 11:40:40 ptlrpc_cache 195920 150466560 195920 150466560 39184 160497664 39184 160497664 167936 0
slab-details/size-1024.dat:20151213 11:40:40 size-1024 133537 136741888 133540 136744960 33385 136744960 33385 136744960 73728 0
slab-details/size-512.dat:20151213 11:40:40 size-512 150577 77095424 153896 78794752 19237 78794752 19237 78794752 0 0
...

lola-11

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
slab-details/size-1048576.dat:20151214 12:05:40 size-1048576 29392 30819745792 29392 30819745792 29392 30819745792 29392 30819745792 0 0
slab-details/size-262144.dat:20151214 12:05:40 size-262144 1345 352583680 1345 352583680 1345 352583680 1345 352583680 0 0
slab-details/ptlrpc_cache.dat:20151214 12:05:40 ptlrpc_cache 224232 172210176 224290 172254720 44858 183738368 44858 183738368 57344 0
slab-details/size-1024.dat:20151214 12:05:40 size-1024 150612 154226688 150632 154247168 37655 154234880 37658 154247168 -20480 0
slab-details/size-8192.dat:20151214 12:05:40 size-8192 8859 72572928 8859 72572928 8859 72572928 8859 72572928 0 0
...

Attached messages, console log files and extracted collectl counters for memory, slab-global, slab-details for each node.

Comment by Frank Heckes (Inactive) [ 21/Dec/15 ]

Error appeared again on a MDT for build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

Comment by Cliff White (Inactive) [ 24/Jan/17 ]

Old issue not reproduced on recent builds

Generated at Sat Feb 10 02:09:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.