[LU-7836] MDSes crashed with oom-killer Created: 02/Mar/16 Updated: 05/Aug/16 Resolved: 05/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration) Note: This might be a duplicate of Sequence of events:
Attached messages, console and debug logs of nodes lola-8,10,11, |
| Comments |
| Comment by Frank Heckes (Inactive) [ 02/Mar/16 ] |
|
slab counters of lola-8 can be uploaded on demand. |
| Comment by Oleg Drokin [ 04/Mar/16 ] |
|
just as I asked in Do you think you can collect something like that? |
| Comment by Oleg Drokin [ 04/Mar/16 ] |
|
you do this by adding the "malloc" debug mask to the run before the problem starts. |
| Comment by Frank Heckes (Inactive) [ 07/Mar/16 ] |
|
Debug mask has been extended with '+malloc' |
| Comment by Frank Heckes (Inactive) [ 07/Mar/16 ] |
|
The error didn't happened again till now. |
| Comment by Frank Heckes (Inactive) [ 14/Mar/16 ] |
|
The issue happened again during soak testing of b2_8 RC5 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309). The MDS nodes, especially lola-11 had been restarted (randomly) at the following times:
The oom-killer on lola-11 ran at 2016-03-12 01:05 after the last restart for lola-11 finishing at 2016-03-11 20:32:39 (line 6 in list above). |
| Comment by Frank Heckes (Inactive) [ 14/Mar/16 ] |
|
Uploaded files are
|
| Comment by Frank Heckes (Inactive) [ 14/Mar/16 ] |
|
The upload of file lola-11-lustre-loglog.20160311-183459.bz2 stalled every time I time half the way before completion. The effect of continuously increasing amount of slabs immediately occured after the clean-up (remount of MDTS on lola-10) and restart
|
| Comment by Peter Jones [ 14/Mar/16 ] |
|
Di Could you please look into this? Thanks Peter |
| Comment by Di Wang [ 15/Mar/16 ] |
|
After some investigation, it looks like the MDT is blocked on update recovery, then queue too much final ping req there. I will try to make a patch. |
| Comment by Gerrit Updater [ 15/Mar/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18915 |
| Comment by Peter Jones [ 16/Mar/16 ] |
|
Moving to 2.9 because it seems that this issue only occurs with multiple MDTs per MDS and does not happen with the more common configuration of a single MDT per MDS. Is this a duplicate of |
| Comment by Frank Heckes (Inactive) [ 17/Mar/16 ] |
|
Soak has been continued to execute b2_8 RC5 build with reformatted Lustre FS. |
| Comment by Frank Heckes (Inactive) [ 18/Mar/16 ] |
|
The error happened after executions of soak test for approximately ~ 73 hours. Sequence of events
The distribution of the biggest consumers is similar to the 2 MDTs per MDS configuration listed above: #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct size-1048576.dat:20160318 03:27:00 size-1048576 10115 10606346240 10115 10606346240 10115 10606346240 10115 10606346240 0 0 size-262144.dat:20160318 03:27:00 size-262144 449 117702656 449 117702656 449 117702656 449 117702656 0 0 size-8192.dat:20160318 03:27:00 size-8192 4663 38199296 4663 38199296 4663 38199296 4663 38199296 0 0 size-1024.dat:20160318 03:27:00 size-1024 35610 36464640 35628 36483072 8903 36466688 8907 36483072 106496 0 ptlrpc_cache.dat:20160318 03:27:00 ptlrpc_cache 41048 31524864 41080 31549440 8216 33652736 8216 33652736 53248 0 size-65536.dat:20160318 03:27:00 size-65536 360 23592960 360 23592960 360 23592960 360 23592960 0 0 size-512.dat:20160318 03:27:00 size-512 45384 23236608 45472 23281664 5684 23281664 5684 23281664 8192 0 kmem_cache.dat:20160318 03:27:00 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0 inode_cache.dat:20160318 03:27:00 inode_cache 15638 9257696 15684 9284928 2614 10706944 2614 10706944 0 0 Acpi-Operand.dat:20160318 03:27:00 Acpi-Operand 133270 9595440 135468 9753696 2556 10469376 2556 10469376 0 0 After the occurrance of the error, debug logs with filter '+malloc + trace have been taken for ~ 30 Mins in 4 minutes intervals. Buffer size was increased from initially 128MB to 1024 MB.
|
| Comment by Frank Heckes (Inactive) [ 18/Mar/16 ] |
|
Also, it is impossible to abort the recovery process (see https://jira.hpdd.intel.com/browse/LU-7848?focusedCommentId=146077&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-146077) |
| Comment by Frank Heckes (Inactive) [ 18/Mar/16 ] |
|
debug files have been uploaded. Oleg: I kept them in the binary state as I was unsure what should be extracted |
| Comment by Gerrit Updater [ 21/Apr/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/19693 |
| Comment by Frank Heckes (Inactive) [ 09/May/16 ] |
|
oom-killer was active for build '20160427' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160427) also. The patch above wasn't applied. Crash dump files are saved in: lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-10:48:57 lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-17:34:05 |
| Comment by Frank Heckes (Inactive) [ 08/Jun/16 ] |
|
The error also occurred while soak testing build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601) after the recovery process stalled for a MDT that failed over to the secondary node. Although 1st Event:
2nd Event:
|
| Comment by Frank Heckes (Inactive) [ 08/Jun/16 ] |
|
I double checked the server node lola-11 and found no HW related errors. |
| Comment by Frank Heckes (Inactive) [ 20/Jul/16 ] |
|
The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) during a test session that is ongoing and last already for 7 days. |
| Comment by Gerrit Updater [ 20/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19693/ |
| Comment by Joseph Gmitter (Inactive) [ 26/Jul/16 ] |
|
Is this issue resolved with the landing of the above patch? |
| Comment by Frank Heckes (Inactive) [ 27/Jul/16 ] |
|
New build (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727) include the fix above has been started 1 hour ago. The previous test session for '20160713' ran till yesterday (Aug, 26th) without the occurrence of this bug. |
| Comment by Joseph Gmitter (Inactive) [ 27/Jul/16 ] |
|
Thanks Frank. We can wait until Friday to validate that the issue is resolved. |
| Comment by Joseph Gmitter (Inactive) [ 01/Aug/16 ] |
|
Hi Frank, |
| Comment by Peter Jones [ 05/Aug/16 ] |
|
As this fix has landed and is intended to fix this issue then let's mark this as resolved and then reopen if it reoccurs |