Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.9.0
-
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
-
3
-
9223372036854775807
Description
error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
Sequence of events:
- 2016-07-28 08:48:37 - Soak session started
- 2016-07-28 08:50:34 - First LNet time out:
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8 Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605] Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
(see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
had been written till the node crashed) - 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>] [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20 Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918 EFLAGS: 00000287 Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8 Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28 Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be Jul 28 08:51:03 lola-5 kernel: FS: 0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000 Jul 28 08:51:03 lola-5 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0 Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
- 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
for collect data file:-rw-r--r-- 1 root root 738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
- Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).
Attached files:
message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
collectl memory and slab counters.
We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.
Attachments
Issue Links
- is related to
-
LU-7899 osd_xattr_set() to batch actual EA update
- Resolved