Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.9.0
-
lola
build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
-
3
-
9223372036854775807
Description
error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
OSTs formatted with zfs, MDSs formatted with ldiskfs
DNE is enabled, HSM/robinhood enable and integrated
4 MDSs with 1 MDT / MDS
6 OSSs with 4 OSTs / OSS
Server nodes configured in active-active HA confguration
Sequence of events:
- 2016-07-28 08:48:37 - Soak session started
- 2016-07-28 08:50:34 - First LNet time out:
Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8 Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605] Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
(see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
had been written till the node crashed) - 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>] [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20 Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918 EFLAGS: 00000287 Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8 Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28 Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be Jul 28 08:51:03 lola-5 kernel: FS: 0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000 Jul 28 08:51:03 lola-5 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0 Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
- 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
for collect data file:-rw-r--r-- 1 root root 738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
- Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).
Attached files:
message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
collectl memory and slab counters.
We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.
Attachments
- all-lustre-log.tar.bz2
- 759 kB
- allocation-per-slab.tar.bz2
- 1.81 MB
- console-lola-5.log.bz2
- 698 kB
- lola-6.timeouts.txt
- 0.2 kB
- lola-7.errors.txt
- 619 kB
- messages-lola-5.log.bz2
- 582 kB
Issue Links
- is related to
-
LU-7899 osd_xattr_set() to batch actual EA update
-
- Resolved
-
Activity
Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached
This is likely introduced by:
commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43 Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Fri Mar 25 12:21:16 2016 +0300 LU-7899 osd: batch EA updates during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs. calling into SA to refill spill again and again is expensive. thus it makes sense to postpone this to osd_trans_stop() where all changed EAs has been already collected in a temporary buffer. Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-on: http://review.whamcloud.com/19143 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Lai Siyao <lai.siyao@intel.com> Reviewed-by: Bobi Jam <bobijam@hotmail.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Alex, can you comment here?
The soft lockups appear within minutes of starting soak. Once the first one occurs, we get a cascade.
Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580] Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: CPU 3 Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: Aug 4 19:58:26 lola-7 kernel: Pid: 20580, comm: ll_ost_io00_031 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Aug 4 19:58:26 lola-7 kernel: RIP: 0010:[<ffffffff8129e8a9>] [<ffffffff8129e8a9>] __write_lock_failed+0x9/0x20 Aug 4 19:58:26 lola-7 kernel: RSP: 0018:ffff88030bfb3778 EFLAGS: 00000287 Aug 4 19:58:26 lola-7 kernel: RAX: ffff8802a01ab780 RBX: ffff88030bfb3780 RCX: 0000000000026c0a Aug 4 19:58:26 lola-7 kernel: RDX: ffff8803c52a8000 RSI: ffff8803c52a8000 RDI: ffff8803a66afb88 Aug 4 19:58:26 lola-7 kernel: RBP: ffffffff8100bc0e R08: ffff8802d28a96c0 R09: 0000000000000008 Aug 4 19:58:26 lola-7 kernel: R10: 0000000a0007d83d R11: 0000000000000000 R12: ffff88030bfb36f0 Aug 4 19:58:26 lola-7 kernel: R13: 0000000000000000 R14: ffffffffa0943036 R15: ffff880360c1f660 Aug 4 19:58:26 lola-7 kernel: FS: 0000000000000000(0000) GS:ffff880038660000(0000) knlGS:0000000000000000 Aug 4 19:58:26 lola-7 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 4 19:58:26 lola-7 kernel: CR2: 0000000001620874 CR3: 0000000001a8d000 CR4: 00000000000407e0 Aug 4 19:58:26 lola-7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 4 19:58:26 lola-7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 4 19:58:26 lola-7 kernel: Process ll_ost_io00_031 (pid: 20580, threadinfo ffff88030bfb0000, task ffff88030bc72ab0) Aug 4 19:58:26 lola-7 kernel: Stack: Aug 4 19:58:26 lola-7 kernel: ffffffff8153d007 ffff88030bfb37b0 ffffffffa0fe2ab0 ffff8802d28a96d0 Aug 4 19:58:26 lola-7 kernel: <d> ffff8803c52a8000 ffff8803a66afad0 ffff88030bfb38d0 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: <d> ffffffffa0fee79b 0000000000000000 ffff8803f6147180 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:58:26 lola-7 kernel: Code: 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 f3 90 <81> 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff ff ff Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task txg_quiesce:12866 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: txg_quiesce D 0000000000000009 0 12866 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040f2afd20 0000000000000046 0000000000000000 ffff88040f2afce4 Aug 4 19:59:41 lola-7 kernel: 0000020a00000000 ffff88043fe83000 0000020be7124652 ffff880038675a00 Aug 4 19:59:41 lola-7 kernel: 000000000000af7e 00000001001dbf93 ffff88041a5f7ad8 ffff88040f2affd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313e5d>] cv_wait_common+0x11d/0x130 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffff812a0d8c>] ? __bitmap_weight+0x8c/0xb0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313ec5>] __cv_wait+0x15/0x20 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d19fb>] txg_quiesce_thread+0x32b/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f238>] thread_generic_wrapper+0x68/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f1d0>] ? thread_generic_wrapper+0x0/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task ll_ost_io01_001:13081 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: ll_ost_io01_0 D 0000000000000005 0 13081 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040abdf818 0000000000000046 ffff88040abdf7e0 ffff88040abdf7dc Aug 4 19:59:41 lola-7 kernel: 0000000000000000 ffff88043fe82800 0000020be1fe58b0 ffff8800386d5a00 Aug 4 19:59:41 lola-7 kernel: 0000000000000400 00000001001dbf28 ffff88040abc45f8 ffff88040abdffd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153c2e4>] ? down_read+0x24/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0fe88d7>] osd_attr_get+0x77/0x2e0 [osd_zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1122861>] ofd_attr_handle_ugid+0x161/0x430 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112842d>] ofd_write_attr_set+0xad/0x8b0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1128e5e>] ofd_commitrw_write+0x22e/0xfa0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112e23d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112a18f>] ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b612e4>] obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b6a080>] tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff8105ab9c>] ? check_preempt_curr+0x7c/0x90 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b688bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b153c1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.
Do we have stack traces for the soft lockups?
Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?
Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug)
Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58
Soak restart Thu Aug 4 19:38:27 PDT 2016
First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]
I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node (lola-2) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with:
Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu e, 0 seconds Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16 8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00 Aug 3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505] Aug 3 01:43:59 lola-2 kernel: Modules linked in:
(Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.
If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.
Could this be related to "LU-7646 lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?
The top-10-consumers read as:
#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0 slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0 slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5 slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0 slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0 slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0 slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0 slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0 slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0 slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0
see also attached files slab-sorted-alloaction.dat, allocation-per-slab.tar.bz2
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827
Subject:
LU-8449osd: retake locks as plannedProject: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6