[LU-7721] lfsck: slab 'size-1048576' exhaust memory (oom-killer) Created: 28/Jan/16 Updated: 29/Jan/16 Resolved: 29/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The error happened during soak testing of build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126). DNE is enabled. Sequence of events:
|
| Comments |
| Comment by nasf (Inactive) [ 28/Jan/16 ] |
|
According to the log file messages-lola-5.log.bz2, there are 4 kinds of threads hung there: 1) out handlers for attr set. These RPC handlers are hit ASSERT() as described in Jan 27 08:46:18 lola-5 kernel: LustreError: 7807:0:(osd_object.c:925:osd_attr_set()) ASSERTION( dt_object_exists(dt) ) failed: Jan 27 08:46:18 lola-5 kernel: LustreError: 7807:0:(osd_object.c:925:osd_attr_set()) LBUG Jan 27 08:46:18 lola-5 kernel: Pid: 7807, comm: ll_ost_out00_00 Jan 27 08:46:18 lola-5 kernel: Jan 27 08:46:18 lola-5 kernel: Call Trace: Jan 27 08:46:18 lola-5 kernel: [<ffffffffa05c5875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa05c5e77>] lbug_with_loc+0x47/0xb0 [libcfs] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa0b25af5>] osd_attr_set+0xdd5/0xe40 [osd_zfs] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa070e795>] ? keys_fill+0xd5/0x1b0 [obdclass] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa02da916>] ? spl_kmem_alloc+0x96/0x1a0 [spl] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa09b2033>] out_tx_attr_set_exec+0xa3/0x480 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa09a849a>] out_tx_end+0xda/0x5c0 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa09ae364>] out_handle+0x11c4/0x19a0 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 Jan 27 08:46:18 lola-5 kernel: [<ffffffffa099d6fa>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa09a4f9c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffffa094c201>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 Jan 27 08:46:18 lola-5 kernel: [<ffffffffa094b3c0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Jan 27 08:46:18 lola-5 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jan 27 08:46:18 lola-5 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 27 08:46:18 lola-5 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jan 27 08:46:18 lola-5 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x2 2) IO server threads. They are blocked when start transaction on ZFS. Jan 27 08:56:22 lola-5 kernel: LNet: Service thread pid 7776 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 27 08:56:22 lola-5 kernel: Pid: 7776, comm: ll_ost_io01_003 Jan 27 08:56:22 lola-5 kernel: Jan 27 08:56:22 lola-5 kernel: Call Trace: Jan 27 08:56:22 lola-5 kernel: [<ffffffff8117591d>] ? kmem_cache_alloc_node_trace+0x1cd/0x200 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8109ee6e>] ? prepare_to_wait_exclusive+0x4e/0x80 Jan 27 08:56:22 lola-5 kernel: [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl] Jan 27 08:56:22 lola-5 kernel: [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20 Jan 27 08:56:22 lola-5 kernel: [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0356d4d>] dmu_tx_wait+0x21d/0x400 [zfs] Jan 27 08:56:22 lola-5 kernel: [<ffffffff8152b844>] ? mutex_lock+0x24/0x50 Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0357121>] dmu_tx_assign+0xa1/0x570 [zfs] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0b18f5d>] osd_trans_start+0xed/0x430 [osd_zfs] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0fccc4c>] ofd_trans_start+0x7c/0x100 [ofd] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0fd4993>] ofd_commitrw_write+0x583/0x10f0 [ofd] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa0fd5abf>] ofd_commitrw+0x5bf/0xbf0 [ofd] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa06ec921>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa099cfd4>] obd_commitrw+0x114/0x380 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa09a6790>] tgt_brw_write+0xc70/0x1540 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffff8105c256>] ? enqueue_task+0x66/0x80 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90 Jan 27 08:56:22 lola-5 kernel: [<ffffffff81064a6e>] ? try_to_wake_up+0x24e/0x3e0 Jan 27 08:56:22 lola-5 kernel: [<ffffffffa093e070>] ? lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa08f35d0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa09a4f9c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffffa094c201>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 Jan 27 08:56:22 lola-5 kernel: [<ffffffffa094b3c0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Jan 27 08:56:22 lola-5 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jan 27 08:56:22 lola-5 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 3) OUT handler for orphan clean up. They are blocked when start transaction on ZFS. Jan 27 17:51:32 lola-5 kernel: Pid: 8262, comm: ll_ost02_051 Jan 27 17:51:32 lola-5 kernel: Jan 27 17:51:32 lola-5 kernel: Call Trace: Jan 27 17:51:32 lola-5 kernel: [<ffffffffa02e178d>] cv_wait_common+0x11d/0x130 [spl] Jan 27 17:51:32 lola-5 kernel: [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 Jan 27 17:51:32 lola-5 kernel: [<ffffffffa02e17f5>] __cv_wait+0x15/0x20 [spl] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa039877b>] txg_wait_open+0x8b/0xd0 [zfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0356f27>] dmu_tx_wait+0x3f7/0x400 [zfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa036b5da>] ? dsl_dir_tempreserve_space+0xca/0x190 [zfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0357121>] dmu_tx_assign+0xa1/0x570 [zfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0b18f5d>] osd_trans_start+0xed/0x430 [osd_zfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0fcef10>] ofd_object_destroy+0x280/0x8e0 [ofd] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0fc8f0d>] ofd_destroy_by_fid+0x35d/0x620 [ofd] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa090abc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa090c530>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa05d0d01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa0fc0d13>] ofd_create_hdl+0x1a23/0x2640 [ofd] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa093a66b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa09a4f9c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffffa094c201>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 Jan 27 17:51:32 lola-5 kernel: [<ffffffffa094b3c0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Jan 27 17:51:32 lola-5 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jan 27 17:51:32 lola-5 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 27 17:51:32 lola-5 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jan 27 17:51:32 lola-5 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 4) OUT handler for create. They are blocked at the mutex_lock(). Jan 27 17:59:44 lola-5 kernel: [<ffffffff8152bd26>] __mutex_lock_slowpath+0x96/0x210 Jan 27 17:59:44 lola-5 kernel: [<ffffffffa0fcbe7b>] ? ofd_seq_load+0xbb/0xa90 [ofd] Jan 27 17:59:44 lola-5 kernel: [<ffffffff8152b84b>] mutex_lock+0x2b/0x50 Jan 27 17:59:44 lola-5 kernel: [<ffffffffa0fbff18>] ofd_create_hdl+0xc28/0x2640 [ofd] Jan 27 17:59:44 lola-5 kernel: [<ffffffffa093a66b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffffa093a7a6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffffa093a8f1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffffa09a4f9c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffffa094c201>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffff8106c560>] ? pick_next_task_fair+0xd0/0x130 Jan 27 17:59:44 lola-5 kernel: [<ffffffff8152a126>] ? schedule+0x176/0x3a0 Jan 27 17:59:44 lola-5 kernel: [<ffffffffa094b3c0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Jan 27 17:59:44 lola-5 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Jan 27 17:59:44 lola-5 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Jan 27 17:59:44 lola-5 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Jan 27 17:59:44 lola-5 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Because of more and more RPC service threads are blocked, then more and more RAM are occupied, as to OOM finally. Since the 1)/2)/3) cases may held some mutex when blocked, it is not strange that the 4) case happened. So the key issue is not why mutex is held, but why the ZFS transactions hang there. According to the stack info, it looks similar as LU-6923. |
| Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ] |
|
Hi Fan Yong, |
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
As explained above, the key issue is that the zfs transactions were blocked for some unknown reason. According to the kernel stack trace, it should be another failure instance of LU-6923 which has no solution yet. |
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
It is another failure instance of LU-6923. |