Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.9.0
-
lola
build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches
-
3
-
9223372036854775807
Description
The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.
The issue might be related to LU-7848 although the change is part of build under test.
The error results in the start of the oom-killer, which is documented in LU-7836. This ticket might be a duplicate of LU-7836.
Events:
1st Event:
- 2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
- 2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
- till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
- 2016-06-04-00:44:52 - lola-11 crash with oom-killer
2nd Event:
- 2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
- 2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
- since 2016-06-07 08:39:32,155 Wait for recovery to complete
- memory resources are nearly exhausted:
[root@lola-11 ~]# date Wed Jun 8 07:59:49 PDT 2016 [root@lola-11 ~]# collectl -sm --verbose waiting for 1 second sample... # MEMORY SUMMARY #<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------> # Total Used Free Buff Cached Slab Mapped Anon Commit Locked Inact Total Used Free In Out Fault MajFt In Out 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 28 0 0 8 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 63 0 0 4 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 1 0 0 0 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 17 0 0 0
Attached files:
*1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 - 2nd event only: kernel debug log of lola-11, dmesg
- Both event: messages, console logs