Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration

Sequence of events:

2016-07-28 08:48:37 - Soak session started

2016-07-28 08:50:34 - First LNet time out:

 
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]

(see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
had been written till the node crashed)

2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')

Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)

2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
for collect data file:
```
-rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
```
Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

Attached files:
message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
collectl memory and slab counters.

We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

abrt-kernel-oops.tar.bz2
29/Jul/16 12:41 PM
3 kB
Frank Heckes
all-lustre-log.tar.bz2
29/Jul/16 12:41 PM
759 kB
Frank Heckes
allocation-per-slab.tar.bz2
29/Jul/16 12:44 PM
1.81 MB
Frank Heckes
console-lola-5.log.bz2
29/Jul/16 12:44 PM
698 kB
Frank Heckes
lola-2-leak_finder.output.bz2
03/Aug/16 4:42 PM
221 kB
Frank Heckes
lola-2-lustre-log.1470213950.128013.bz2
03/Aug/16 4:42 PM
0.3 kB
Frank Heckes
lola-6.timeouts.txt
10/Aug/16 12:09 AM
0.2 kB
Cliff White
lola-7.errors.txt
05/Aug/16 7:43 PM
619 kB
Cliff White
memory-counter-lola-5-20160728-021116.dat.bz2
29/Jul/16 12:41 PM
80 kB
Frank Heckes
messages-lola-5.log.bz2
29/Jul/16 12:44 PM
582 kB
Frank Heckes
slab-details-counter-lola-5-20160728-021116.dat.bz2
29/Jul/16 12:41 PM
2.75 MB
Frank Heckes
slab-sorted-alloaction.dat.bz2
29/Jul/16 12:44 PM
5 kB
Frank Heckes

Issue Links

is related to

LU-7899 osd_xattr_set() to batch actual EA update

Resolved

OSS crash with oom-killer started

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates