[LU-8449] OSS crash with oom-killer started Created: 29/Jul/16 Updated: 11/Aug/16 Resolved: 11/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727) Sequence of events:
Attached files: We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see. |
| Comments |
| Comment by Frank Heckes (Inactive) [ 29/Jul/16 ] |
|
The top-10-consumers read as: #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0 slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0 slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5 slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0 slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0 slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0 slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0 slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0 slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0 slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0 see also attached files slab-sorted-alloaction.dat, allocation-per-slab.tar.bz2 |
| Comment by Andreas Dilger [ 29/Jul/16 ] |
|
Could this be related to " |
| Comment by Doug Oucharek (Inactive) [ 02/Aug/16 ] |
|
If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE. |
| Comment by Frank Heckes (Inactive) [ 03/Aug/16 ] |
|
I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node (lola-2) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with: Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu e, 0 seconds Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16 8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00 Aug 3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505] Aug 3 01:43:59 lola-2 kernel: Modules linked in: (Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log. |
| Comment by Cliff White (Inactive) [ 05/Aug/16 ] |
|
Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug) |
| Comment by John Hammond [ 05/Aug/16 ] |
|
Do we have stack traces for the soft lockups? Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce? |
| Comment by John Hammond [ 05/Aug/16 ] |
|
What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console. |
| Comment by Cliff White (Inactive) [ 05/Aug/16 ] |
|
The soft lockups appear within minutes of starting soak. Once the first one occurs, we get a cascade. Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580] Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: CPU 3 Aug 4 19:58:26 lola-7 kernel: Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm raid1 dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core joydev i2c_i801 ioatdma ses enclosure sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 4 19:58:26 lola-7 kernel: Aug 4 19:58:26 lola-7 kernel: Pid: 20580, comm: ll_ost_io00_031 Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Aug 4 19:58:26 lola-7 kernel: RIP: 0010:[<ffffffff8129e8a9>] [<ffffffff8129e8a9>] __write_lock_failed+0x9/0x20 Aug 4 19:58:26 lola-7 kernel: RSP: 0018:ffff88030bfb3778 EFLAGS: 00000287 Aug 4 19:58:26 lola-7 kernel: RAX: ffff8802a01ab780 RBX: ffff88030bfb3780 RCX: 0000000000026c0a Aug 4 19:58:26 lola-7 kernel: RDX: ffff8803c52a8000 RSI: ffff8803c52a8000 RDI: ffff8803a66afb88 Aug 4 19:58:26 lola-7 kernel: RBP: ffffffff8100bc0e R08: ffff8802d28a96c0 R09: 0000000000000008 Aug 4 19:58:26 lola-7 kernel: R10: 0000000a0007d83d R11: 0000000000000000 R12: ffff88030bfb36f0 Aug 4 19:58:26 lola-7 kernel: R13: 0000000000000000 R14: ffffffffa0943036 R15: ffff880360c1f660 Aug 4 19:58:26 lola-7 kernel: FS: 0000000000000000(0000) GS:ffff880038660000(0000) knlGS:0000000000000000 Aug 4 19:58:26 lola-7 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 4 19:58:26 lola-7 kernel: CR2: 0000000001620874 CR3: 0000000001a8d000 CR4: 00000000000407e0 Aug 4 19:58:26 lola-7 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 4 19:58:26 lola-7 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 4 19:58:26 lola-7 kernel: Process ll_ost_io00_031 (pid: 20580, threadinfo ffff88030bfb0000, task ffff88030bc72ab0) Aug 4 19:58:26 lola-7 kernel: Stack: Aug 4 19:58:26 lola-7 kernel: ffffffff8153d007 ffff88030bfb37b0 ffffffffa0fe2ab0 ffff8802d28a96d0 Aug 4 19:58:26 lola-7 kernel: <d> ffff8803c52a8000 ffff8803a66afad0 ffff88030bfb38d0 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: <d> ffffffffa0fee79b 0000000000000000 ffff8803f6147180 ffff88030bfb37e0 Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:58:26 lola-7 kernel: Code: 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 f3 90 <81> 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff ff ff Aug 4 19:58:26 lola-7 kernel: Call Trace: Aug 4 19:58:26 lola-7 kernel: [<ffffffff8153d007>] ? _write_lock+0x17/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fe2ab0>] ? osd_object_sa_dirty_add+0x40/0x80 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fee79b>] ? __osd_sa_xattr_schedule_update+0x4b/0x160 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff09af>] ? __osd_sa_xattr_set+0x25f/0x350 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0ff0d8c>] ? osd_xattr_set+0x2ec/0x4d0 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090cdb2>] ? dt_version_set+0x72/0x230 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b5a443>] ? tgt_txn_stop_cb+0x2c3/0x1160 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa090b96e>] ? dt_txn_hook_stop+0x5e/0x90 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0fddc3c>] ? osd_trans_stop+0x29c/0x520 [osd_zfs] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1121a0f>] ? ofd_trans_stop+0x1f/0x60 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa1129100>] ? ofd_commitrw_write+0x4d0/0xfa0 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa112a18f>] ? ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:58:26 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b612e4>] ? obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b6a080>] ? tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b688bc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b153c1>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 Aug 4 19:58:26 lola-7 kernel: [<ffffffff81539896>] ? schedule+0x176/0x3a0 Aug 4 19:58:26 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Aug 4 19:58:26 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:58:26 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task txg_quiesce:12866 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: txg_quiesce D 0000000000000009 0 12866 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040f2afd20 0000000000000046 0000000000000000 ffff88040f2afce4 Aug 4 19:59:41 lola-7 kernel: 0000020a00000000 ffff88043fe83000 0000020be7124652 ffff880038675a00 Aug 4 19:59:41 lola-7 kernel: 000000000000af7e 00000001001dbf93 ffff88041a5f7ad8 ffff88040f2affd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313e5d>] cv_wait_common+0x11d/0x130 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffff812a0d8c>] ? __bitmap_weight+0x8c/0xb0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0313ec5>] __cv_wait+0x15/0x20 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d19fb>] txg_quiesce_thread+0x32b/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa03d16d0>] ? txg_quiesce_thread+0x0/0x470 [zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f238>] thread_generic_wrapper+0x68/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa030f1d0>] ? thread_generic_wrapper+0x0/0x80 [spl] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 4 19:59:41 lola-7 kernel: INFO: task ll_ost_io01_001:13081 blocked for more than 120 seconds. Aug 4 19:59:41 lola-7 kernel: Tainted: P --L------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Aug 4 19:59:41 lola-7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 4 19:59:41 lola-7 kernel: ll_ost_io01_0 D 0000000000000005 0 13081 2 0x00000080 Aug 4 19:59:41 lola-7 kernel: ffff88040abdf818 0000000000000046 ffff88040abdf7e0 ffff88040abdf7dc Aug 4 19:59:41 lola-7 kernel: 0000000000000000 ffff88043fe82800 0000020be1fe58b0 ffff8800386d5a00 Aug 4 19:59:41 lola-7 kernel: 0000000000000400 00000001001dbf28 ffff88040abc45f8 ffff88040abdffd8 Aug 4 19:59:41 lola-7 kernel: Call Trace: Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8153c2e4>] ? down_read+0x24/0x30 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0fe88d7>] osd_attr_get+0x77/0x2e0 [osd_zfs] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1122861>] ofd_attr_handle_ugid+0x161/0x430 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112842d>] ofd_write_attr_set+0xad/0x8b0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa1128e5e>] ofd_commitrw_write+0x22e/0xfa0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112e23d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa112a18f>] ofd_commitrw+0x5bf/0xb10 [ofd] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81149461>] ? kzfree+0x31/0x40 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa08e8121>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b612e4>] obd_commitrw+0x114/0x380 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b6a080>] tgt_brw_write+0xc70/0x1530 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff8105ab9c>] ? check_preempt_curr+0x7c/0x90 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0abee20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b688bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b153c1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Aug 4 19:59:41 lola-7 kernel: [<ffffffffa0b14690>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 4 19:59:41 lola-7 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 4 19:59:41 lola-7 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by John Hammond [ 05/Aug/16 ] |
|
This is likely introduced by: commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43
Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Date: Fri Mar 25 12:21:16 2016 +0300
LU-7899 osd: batch EA updates
during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs.
calling into SA to refill spill again and again is expensive. thus it
makes sense to postpone this to osd_trans_stop() where all changed EAs
has been already collected in a temporary buffer.
Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-on: http://review.whamcloud.com/19143
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Alex, can you comment here? |
| Comment by Cliff White (Inactive) [ 05/Aug/16 ] |
|
Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached |
| Comment by Gerrit Updater [ 09/Aug/16 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827 |
| Comment by Cliff White (Inactive) [ 10/Aug/16 ] |
|
We have not seen an oom-killer, however the immediate soft lockups continue. |
| Comment by Cliff White (Inactive) [ 10/Aug/16 ] |
|
Will continue with run to see if it ooo-kills |
| Comment by Cliff White (Inactive) [ 10/Aug/16 ] |
|
System did not oom-kill however all OSS are unusable due to massive timeous. |
| Comment by Alex Zhuravlev [ 10/Aug/16 ] |
|
Cliff, with the patch above? |
| Comment by Cliff White (Inactive) [ 10/Aug/16 ] |
|
Yes, with the patch above |
| Comment by Alex Zhuravlev [ 10/Aug/16 ] |
|
hmm... would you mind to try refreshed version, please? please, attach the logs. |
| Comment by Peter Jones [ 11/Aug/16 ] |
|
Fixed by reverting |