[LU-8449] OSS crash with oom-killer started - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration

Sequence of events:

2016-07-28 08:48:37 - Soak session started

2016-07-28 08:50:34 - First LNet time out:

 
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]

(see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
had been written till the node crashed)

2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')

Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)

2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
for collect data file:
```
-rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
```
Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

Attached files:
message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
collectl memory and slab counters.

We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

abrt-kernel-oops.tar.bz2
3 kB
29/Jul/16 12:41 PM
all-lustre-log.tar.bz2
759 kB
29/Jul/16 12:41 PM
allocation-per-slab.tar.bz2
1.81 MB
29/Jul/16 12:44 PM
console-lola-5.log.bz2
698 kB
29/Jul/16 12:44 PM
lola-2-leak_finder.output.bz2
221 kB
03/Aug/16 4:42 PM
lola-2-lustre-log.1470213950.128013.bz2
0.3 kB
03/Aug/16 4:42 PM
lola-6.timeouts.txt
0.2 kB
10/Aug/16 12:09 AM
lola-7.errors.txt
619 kB
05/Aug/16 7:43 PM
memory-counter-lola-5-20160728-021116.dat.bz2
80 kB
29/Jul/16 12:41 PM
messages-lola-5.log.bz2
582 kB
29/Jul/16 12:44 PM
slab-details-counter-lola-5-20160728-021116.dat.bz2
2.75 MB
29/Jul/16 12:41 PM
slab-sorted-alloaction.dat.bz2
5 kB
29/Jul/16 12:44 PM

Issue Links

is related to

LU-7899 osd_xattr_set() to batch actual EA update

Resolved

Activity

[LU-8449] OSS crash with oom-killer started

John Hammond added a comment - 05/Aug/16 6:05 PM

What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.

John Hammond added a comment - 05/Aug/16 6:05 PM What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.

John Hammond added a comment - 05/Aug/16 6:01 PM

Do we have stack traces for the soft lockups?

Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?

John Hammond added a comment - 05/Aug/16 6:01 PM Do we have stack traces for the soft lockups? Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?

Cliff White (Inactive) added a comment - 05/Aug/16 2:06 AM - edited

Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug)
Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58
Soak restart Thu Aug 4 19:38:27 PDT 2016
First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

Cliff White (Inactive) added a comment - 05/Aug/16 2:06 AM - edited Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug) Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58 Soak restart Thu Aug 4 19:38:27 PDT 2016 First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

Frank Heckes (Inactive) added a comment - 03/Aug/16 4:30 PM

I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node (lola-2) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with:

Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu
e, 0 seconds
Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16
8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7
Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400
Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00
Aug  3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505]
Aug  3 01:43:59 lola-2 kernel: Modules linked in:

(Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.

Frank Heckes (Inactive) added a comment - 03/Aug/16 4:30 PM I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node ( lola-2 ) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with: Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu e, 0 seconds Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16 8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00 Aug 3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505] Aug 3 01:43:59 lola-2 kernel: Modules linked in: (Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.

Doug Oucharek (Inactive) added a comment - 02/Aug/16 10:14 PM

If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.

Doug Oucharek (Inactive) added a comment - 02/Aug/16 10:14 PM If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.

Andreas Dilger added a comment - 29/Jul/16 6:02 PM

Could this be related to "~~LU-7646~~ lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?

Andreas Dilger added a comment - 29/Jul/16 6:02 PM Could this be related to " LU-7646 lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?

Frank Heckes (Inactive) added a comment - 29/Jul/16 12:29 PM

The top-10-consumers read as:

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0
slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0
slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5
slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0
slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0
slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0
slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0
slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0
slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0
slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0

see also attached files slab-sorted-alloaction.dat, allocation-per-slab.tar.bz2

Frank Heckes (Inactive) added a comment - 29/Jul/16 12:29 PM The top-10-consumers read as: #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0 slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0 slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5 slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0 slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0 slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0 slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0 slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0 slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0 slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0 see also attached files slab-sorted-alloaction.dat , allocation-per-slab.tar.bz2

People

Assignee:: Alex Zhuravlev

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 29/Jul/16 12:18 PM

Updated:: 11/Aug/16 5:59 PM

Resolved:: 11/Aug/16 5:59 PM