[LU-7517] oom killer active after failback of MDS resources Created: 04/Dec/15 Updated: 24/Jan/17 Resolved: 24/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola: |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration. Primary resources of MDT lola-11 were failed back at Dec, 3 20:18. |
| Comments |
| Comment by Frank Heckes (Inactive) [ 04/Dec/15 ] |
|
The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7517/127.0.0.1-2015-12-04-00\:22\:36. |
| Comment by Frank Heckes (Inactive) [ 04/Dec/15 ] |
|
No debug log files have been written. |
| Comment by Di Wang [ 04/Dec/15 ] |
|
looks like lola-8 and lola-9 got OOM as well. |
| Comment by Di Wang [ 04/Dec/15 ] |
|
It looks like most the memory is holden by 1M size slab 20151204 00:21:00 size-1048576 29758 31203524608 29758 31203524608 29758 31203524608 29758 31203524608 0 0 |
| Comment by Frank Heckes (Inactive) [ 15/Dec/15 ] |
|
The error showed up on all soak MDSes (lola-8 not reported in detail) running soak for build https://build.hpdd.intel.com/job/lustre-reviews/36192/
From failback till start of oom-killer the size-1048576 slabs continuously increased and are the biggest memory consumers: #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151213 16:05:40 size-1048576 23122 24245174272 23122 24245174272 23122 24245174272 23122 24245174272 0 0 slab-details/size-512.dat:20151213 16:05:40 size-512 11151621 5709629952 11152824 5710245888 1394094 5710209024 1394103 5710245888 7835648 0 slab-details/size-128.dat:20151213 16:05:40 size-128 7076064 905736192 7077360 905902080 235911 966291456 235912 966295552 1376256 0 slab-details/size-262144.dat:20151213 16:05:40 size-262144 1673 438566912 1673 438566912 1673 438566912 1673 438566912 0 0 slab-details/ptlrpc_cache.dat:20151213 16:05:40 ptlrpc_cache 376940 289489920 376940 289489920 75388 308789248 75388 308789248 204800 0 ... lola-10 #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151213 11:40:40 size-1048576 29494 30926700544 29494 30926700544 29494 30926700544 29494 30926700544 0 0 slab-details/size-262144.dat:20151213 11:40:40 size-262144 1015 266076160 1015 266076160 1015 266076160 1015 266076160 0 0 slab-details/ptlrpc_cache.dat:20151213 11:40:40 ptlrpc_cache 195920 150466560 195920 150466560 39184 160497664 39184 160497664 167936 0 slab-details/size-1024.dat:20151213 11:40:40 size-1024 133537 136741888 133540 136744960 33385 136744960 33385 136744960 73728 0 slab-details/size-512.dat:20151213 11:40:40 size-512 150577 77095424 153896 78794752 19237 78794752 19237 78794752 0 0 ... lola-11 #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151214 12:05:40 size-1048576 29392 30819745792 29392 30819745792 29392 30819745792 29392 30819745792 0 0 slab-details/size-262144.dat:20151214 12:05:40 size-262144 1345 352583680 1345 352583680 1345 352583680 1345 352583680 0 0 slab-details/ptlrpc_cache.dat:20151214 12:05:40 ptlrpc_cache 224232 172210176 224290 172254720 44858 183738368 44858 183738368 57344 0 slab-details/size-1024.dat:20151214 12:05:40 size-1024 150612 154226688 150632 154247168 37655 154234880 37658 154247168 -20480 0 slab-details/size-8192.dat:20151214 12:05:40 size-8192 8859 72572928 8859 72572928 8859 72572928 8859 72572928 0 0 ... Attached messages, console log files and extracted collectl counters for memory, slab-global, slab-details for each node. |
| Comment by Frank Heckes (Inactive) [ 21/Dec/15 ] |
|
Error appeared again on a MDT for build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214) |
| Comment by Cliff White (Inactive) [ 24/Jan/17 ] |
|
Old issue not reproduced on recent builds |