Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.

The issue might be related to ~~LU-7848~~ although the change is part of build under test.
The error results in the start of the oom-killer, which is documented in ~~LU-7836~~. This ticket might be a duplicate of ~~LU-7836~~.

Events:
1st Event:

2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
2016-06-04-00:44:52 - lola-11 crash with oom-killer

2nd Event:

2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
since 2016-06-07 08:39:32,155 Wait for recovery to complete

memory resources are nearly exhausted:

[root@lola-11 ~]# date
Wed Jun  8 07:59:49 PDT 2016
[root@lola-11 ~]# collectl -sm --verbose
waiting for 1 second sample...

# MEMORY SUMMARY
#<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------>
#   Total    Used    Free    Buff  Cached    Slab  Mapped    Anon  Commit  Locked Inact Total  Used  Free   In  Out Fault MajFt   In  Out
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    28     0    0    8
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    63     0    0    4
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0     1     0    0    0
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    17     0    0    0

Attached files:
*1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52

2nd event only: kernel debug log of lola-11, dmesg
Both event: messages, console logs

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-11.log-20160608.bz2
63 kB
08/Jun/16 4:04 PM
dmesg-lola-11-20160609-0811.bz2
26 kB
08/Jun/16 4:04 PM
lola-11-lustre-log.20160608-0656.bz2
0.3 kB
08/Jun/16 4:04 PM
messages-lola-11.log-20160608.bz2
311 kB
08/Jun/16 4:04 PM
soak.log.2016-06-11-1417.txt.bz2
366 kB
21/Sep/16 11:51 AM

Issue Links

is duplicated by

LU-8428 Very long recovery times for MDTs after MDS restart

Resolved

is related to

LU-8704 RPC sent inside mdd_create transaction.

Open

LU-8714 too many update logs during soak-test.

Open

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 08/Jun/16 3:54 PM

Updated:: 26/Oct/16 11:24 PM

Resolved:: 26/Oct/16 11:24 PM