[LU-7836] MDSes crashed with oom-killer - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0
Labels:
- soak
Environment:
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/11/

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

Note: This might be a duplicate of ~~LU-7780~~

Sequence of events:

2016-03-01 20:38:40 triggering fault mds_failover (lola-10 --> lola-11)

2016-03-01 20:41:58 lola-8 LNet process hang on lola-8

LNet: Service thread pid 5074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes to /tmp/lustre-log.1456893718.5074

(see attached file lustre-log.1456893718.5074

Lustre Recovery never completed (till next failover)

All slurm jobs are stalled and began to timeout after ~ 20:40
Slabs continuously allocated memory. Top 10 consumers before next
failover are

slab-details/size-1048576.dat:20160301 22:09:20 size-1048576 13215 13856931840 13215 13856931840 13215 13856931840 13215 13856931840 67108864 0
slab-details/size-262144.dat:20160301 22:09:20 size-262144 385 100925440 385 100925440 385 100925440 385 100925440 0 0
slab-details/size-192.dat:20160301 22:09:20 size-192 268155 51485760 268320 51517440 13416 54951936 13416 54951936 24576 0
slab-details/size-8192.dat:20160301 22:09:20 size-8192 5508 45121536 5508 45121536 5508 45121536 5508 45121536 0 0
slab-details/size-1024.dat:20160301 22:09:20 size-1024 41857 42861568 41864 42868736 10466 42868736 10466 42868736 110592 0
slab-details/ptlrpc_cache.dat:20160301 22:09:20 ptlrpc_cache 35590 27333120 35605 27344640 7121 29167616 7121 29167616 122880 0
slab-details/size-65536.dat:20160301 22:09:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
slab-details/size-512.dat:20160301 22:09:20 size-512 37639 19271168 38920 19927040 4865 19927040 4865 19927040 45056 0
slab-details/kmem_cache.dat:20160301 22:09:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
slab-details/size-4096.dat:20160301 22:09:20 size-4096 2970 12165120 2971 12169216 2970 12165120 2971 12169216 4096 0
slab-details/inode_cache.dat:20160301 22:09:20 inode_cache 15612 9242304 15678 9281376 2613 10702848 2613 10702848 0 0

2016-03-01 22:09:46,199:fsmgmt.fsmgmt:INFO triggering fault mds_failover (lola-10 --> lola-11)
2016-03-01 22:16:19,483:fsmgmt.fsmgmt:INFO mds_failover just completed
Recovery of MDTs (mdt-4,5) never completed

2016-03-02 00:15-30 Created stack traces and debug log files on lola-[8,10,11]
*2016-02-02 01:30:01 approximately at this time oom-killer started on MDS lola-10
Top 10 consumers are:

slab-details/size-1048576.dat:20160302 01:32:00 size-1048576 29882 31333548032 29882 31333548032 29882 31333548032 29882 31333548032 1048576 0
slab-details/size-262144.dat:20160302 01:32:00 size-262144 641 168034304 641 168034304 641 168034304 641 168034304 0 0
slab-details/size-1024.dat:20160302 01:32:00 size-1024 84449 86475776 84484 86511616 21121 86511616 21121 86511616 106496 0
slab-details/ptlrpc_cache.dat:20160302 01:32:00 ptlrpc_cache 95492 73337856 95510 73351680 19102 78241792 19102 78241792 114688 0
slab-details/size-192.dat:20160302 01:32:00 size-192 298834 57376128 299260 57457920 14963 61288448 14963 61288448 0 0
slab-details/size-8192.dat:20160302 01:32:00 size-8192 5862 48021504 5862 48021504 5862 48021504 5862 48021504 -8192 0
slab-details/size-512.dat:20160302 01:32:00 size-512 77790 39828480 79096 40497152 9879 40464384 9887 40497152 28672 0
slab-details/size-65536.dat:20160302 01:32:00 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
slab-details/kmem_cache.dat:20160302 01:32:00 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
slab-details/size-128.dat:20160302 01:32:00 size-128 78707 10074496 100920 12917760 3364 13778944 3364 13778944 0 0

*2016-02-03 03:00 MDS lola-8 also crashed with oom-killer

Attached messages, console and debug logs of nodes lola-8,10,11,
file recovery-status-20160302 showing recovery and process status around 2016-03-02 00:15, memory and detailed slab,counters of lola-10

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-10.bz2
363 kB
02/Mar/16 3:20 PM
console-lola-11.bz2
344 kB
02/Mar/16 3:20 PM
console-lola-11.log-20160608.bz2
63 kB
08/Jun/16 3:27 PM
console-lola-8.bz2
523 kB
02/Mar/16 3:20 PM
console-lola-8.log-20160318.bz2
494 kB
18/Mar/16 2:10 PM
dmesg-lola-11-20160609-0811.bz2
26 kB
08/Jun/16 3:27 PM
lola-10-lustre-log.20160302-0008.bz2
4.51 MB
02/Mar/16 3:20 PM
lola-10-memory-counter-2214-0130.dat.bz2
18 kB
02/Mar/16 3:26 PM
lola-10-slab-detailed-2214-0130.dat.bz2
595 kB
02/Mar/16 3:26 PM
lola-11-lustre-log.20160302-0008.bz2
1.33 MB
02/Mar/16 3:26 PM
lola-11-lustre-log.20160608-0656.bz2
0.3 kB
09/Jun/16 7:41 AM
lola-11-mem-counter-20160311_1729-1910.dat.bz2
11 kB
14/Mar/16 11:38 AM
lola-11-memory-debug-messages.20160311-183459.bz2
0.3 kB
14/Mar/16 12:50 PM
lola-8-lustre-log.20160302-0155.bz2
4.42 MB
02/Mar/16 3:20 PM
lola-8-lustre-log-20160318-0134.bz2
1.04 MB
18/Mar/16 1:33 PM
lola-8-lustre-log-20160318-0145.bz2
1.35 MB
18/Mar/16 1:31 PM
lola-8-lustre-log-20160318-0147.bz2
885 kB
18/Mar/16 1:34 PM
lola-8-lustre-log-20160318-0150.bz2
938 kB
18/Mar/16 1:41 PM
lola-8-lustre-log-20160318-0153.bz2
933 kB
18/Mar/16 1:42 PM
lola-8-lustre-log-20160318-0156.bz2
1.02 MB
18/Mar/16 1:48 PM
lola-8-lustre-log-20160318-0200.bz2
1012 kB
18/Mar/16 1:54 PM
lola-8-lustre-log-20160318-0204.bz2
1.00 MB
18/Mar/16 2:04 PM
lola-8-slab-detailed-counter-20160318.tar.bz2
895 kB
18/Mar/16 1:23 PM
lustre-log.1456893718.5074.bz2
0.2 kB
02/Mar/16 3:26 PM
messages-lola-10.bz2
237 kB
02/Mar/16 3:28 PM
messages-lola-11.bz2
171 kB
02/Mar/16 3:28 PM
messages-lola-11.log-20160608.bz2
311 kB
08/Jun/16 3:27 PM
messages-lola-8.bz2
232 kB
02/Mar/16 3:26 PM
messages-lola-8.log-20160318.bz2
729 kB
18/Mar/16 2:11 PM
recovery-status-20160302
4 kB
02/Mar/16 3:28 PM
slab-details-2041-2208-one-file-per-slab.tar.bz2
211 kB
02/Mar/16 3:29 PM
slab-details-2214-0130-one-file-per-slab.tar.bz2
433 kB
02/Mar/16 3:29 PM
slab-sorted-alloaction-2041-2208.dat.bz2
4 kB
02/Mar/16 3:29 PM
slab-sorted-alloaction-2214-0130.dat.bz2
4 kB
02/Mar/16 3:29 PM

Issue Links

is related to

LU-8070 sanity-scrub test_5 oom-killer and times out

Resolved

is related to

LU-7780 MDS crashed with oom-killer

Resolved

Activity

[LU-7836] MDSes crashed with oom-killer

Peter Jones added a comment - 05/Aug/16 9:08 PM

As this fix has landed and is intended to fix this issue then let's mark this as resolved and then reopen if it reoccurs

Peter Jones added a comment - 05/Aug/16 9:08 PM As this fix has landed and is intended to fix this issue then let's mark this as resolved and then reopen if it reoccurs

Joseph Gmitter (Inactive) added a comment - 01/Aug/16 12:33 PM

Hi Frank,
Are you comfortable that the issue is resolved in the soak run from the end of last week?
Thanks.
Joe

Joseph Gmitter (Inactive) added a comment - 01/Aug/16 12:33 PM Hi Frank, Are you comfortable that the issue is resolved in the soak run from the end of last week? Thanks. Joe

Joseph Gmitter (Inactive) added a comment - 27/Jul/16 1:17 PM

Thanks Frank. We can wait until Friday to validate that the issue is resolved.

Joseph Gmitter (Inactive) added a comment - 27/Jul/16 1:17 PM Thanks Frank. We can wait until Friday to validate that the issue is resolved.

Frank Heckes (Inactive) added a comment - 27/Jul/16 12:17 PM

New build (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727) include the fix above has been started 1 hour ago. The previous test session for '20160713' ran till yesterday (Aug, 26th) without the occurrence of this bug.
If possible I'd like to run build '20160727' till Friday before closing the ticket. Anyway feel free to close the ticket.

Frank Heckes (Inactive) added a comment - 27/Jul/16 12:17 PM New build (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727 ) include the fix above has been started 1 hour ago. The previous test session for '20160713' ran till yesterday (Aug, 26th) without the occurrence of this bug. If possible I'd like to run build '20160727' till Friday before closing the ticket. Anyway feel free to close the ticket.

Joseph Gmitter (Inactive) added a comment - 26/Jul/16 8:02 PM

Is this issue resolved with the landing of the above patch?

Joseph Gmitter (Inactive) added a comment - 26/Jul/16 8:02 PM Is this issue resolved with the landing of the above patch?

Gerrit Updater added a comment - 20/Jul/16 5:42 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19693/
Subject: ~~LU-7836~~ ptlrpc: remove duplicate final ping req
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1f2bb415543f8801f3b4a7c55e29b715b09f327b

Gerrit Updater added a comment - 20/Jul/16 5:42 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19693/ Subject: LU-7836 ptlrpc: remove duplicate final ping req Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1f2bb415543f8801f3b4a7c55e29b715b09f327b

Frank Heckes (Inactive) added a comment - 20/Jul/16 12:47 PM

The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) during a test session that is ongoing and last already for 7 days.

Frank Heckes (Inactive) added a comment - 20/Jul/16 12:47 PM The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713 ) during a test session that is ongoing and last already for 7 days.

Frank Heckes (Inactive) added a comment - 08/Jun/16 3:17 PM

I double checked the server node lola-11 and found no HW related errors.

Frank Heckes (Inactive) added a comment - 08/Jun/16 3:17 PM I double checked the server node lola-11 and found no HW related errors.

Frank Heckes (Inactive) added a comment - 08/Jun/16 3:09 PM - edited

The error also occurred while soak testing build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601) after the recovery process stalled for a MDT that failed over to the secondary node. Although ~~LU-7848~~ is part of the build, the recovery process is stalls.
Configuration: DNE enabled; 1 MDT per MDS; nodes lola-10, 11 from an active-active failover cluster.

1st Event:

2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
2016-06-04-00:44:52 - lola-11 crash with oom-killer

2nd Event:

2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
since 2016-06-07 08:39:32,155 Wait for recovery to complete

memory resources are nearly exhausted:

[root@lola-11 ~]# date
Wed Jun  8 07:59:49 PDT 2016
[root@lola-11 ~]# collectl -sm --verbose
waiting for 1 second sample...

# MEMORY SUMMARY
#<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------>
#   Total    Used    Free    Buff  Cached    Slab  Mapped    Anon  Commit  Locked Inact Total  Used  Free   In  Out Fault MajFt   In  Out
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    28     0    0    8
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    63     0    0    4
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0     1     0    0    0
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    17     0    0    0

Attached files:
*1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52

2nd event only: kernel debug log of lola-11
Both event: messages, console logs

Frank Heckes (Inactive) added a comment - 08/Jun/16 3:09 PM - edited The error also occurred while soak testing build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601 ) after the recovery process stalled for a MDT that failed over to the secondary node. Although LU-7848 is part of the build, the recovery process is stalls. Configuration: DNE enabled; 1 MDT per MDS; nodes lola-10, 11 from an active-active failover cluster. 1st Event: 2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11 2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11 till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'. 2016-06-04-00:44:52 - lola-11 crash with oom-killer 2nd Event: 2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11 2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11 since 2016-06-07 08:39:32,155 Wait for recovery to complete memory resources are nearly exhausted: [root@lola-11 ~]# date Wed Jun 8 07:59:49 PDT 2016 [root@lola-11 ~]# collectl -sm --verbose waiting for 1 second sample... # MEMORY SUMMARY #<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------> # Total Used Free Buff Cached Slab Mapped Anon Commit Locked Inact Total Used Free In Out Fault MajFt In Out 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 28 0 0 8 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 63 0 0 4 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 1 0 0 0 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 17 0 0 0 Attached files: *1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 2nd event only: kernel debug log of lola-11 Both event: messages, console logs

Frank Heckes (Inactive) added a comment - 09/May/16 9:39 AM - edited

oom-killer was active for build '20160427' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160427) also. The patch above wasn't applied. Crash dump files are saved in:

lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-10:48:57
lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-17:34:05

Frank Heckes (Inactive) added a comment - 09/May/16 9:39 AM - edited oom-killer was active for build '20160427' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160427 ) also. The patch above wasn't applied. Crash dump files are saved in: lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-10:48:57 lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-05-07-17:34:05

People

Assignee:: Di Wang (Inactive)

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Mar/16 2:29 PM

Updated:: 05/Aug/16 9:08 PM

Resolved:: 05/Aug/16 9:08 PM