[LU-10407] osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: Created: 19/Dec/17  Updated: 14/Jan/21  Resolved: 14/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Alexey Lyashkov Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 7 + master 063a83ab1fe518e52dbc7fb5f6e9d092b20f44e9.


Issue Links:
Related
is related to LU-11463 DOM: osc_extent_make_ready()) ASSERTI... Reopened
is related to LU-14326 sanity-dom test_fsx: crash in osc_ext... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

while testing an own patch I hit a this panic.
As I see, it's result of not atomically counting of unstable pages.
osc_io_commit_async add to cache and unlock an osc object, and size will be update a short after. osc_io_unplug called from brw_queue_work and found a pages in cache, when tries to count an unstable pages number while panic a hit.

this panic can easy replicated if we have delay a size updates with patch

diff --git a/lustre/osc/osc_io.c b/lustre/osc/osc_io.c
index 3d353324f1..5471b96ec1 100644
--- a/lustre/osc/osc_io.c
+++ b/lustre/osc/osc_io.c
@@ -267,7 +267,7 @@ int osc_io_commit_async(const struct lu_env *env,
 	struct osc_object *osc = cl2osc(ios->cis_obj);
 	struct cl_page  *page;
 	struct cl_page  *last_page;
-	struct osc_page *opg;
+	struct osc_page *opg = NULL;
 	int result = 0;
 	ENTRY;
 
@@ -311,9 +311,6 @@ int osc_io_commit_async(const struct lu_env *env,
 				break;
 		}
 
-		osc_page_touch_at(env, osc2cl(osc), osc_index(opg),
-				  page == last_page ? to : PAGE_SIZE);
-
 		cl_page_list_del(env, qin, page);
 
 		(*cb)(env, io, page);
@@ -321,6 +318,9 @@ int osc_io_commit_async(const struct lu_env *env,
 		 * complete at any time. */
 	}
 
+	osc_page_touch_at(env, osc2cl(osc), osc_index(opg),
+			  page == last_page ? to : PAGE_SIZE);
+
 	/* for sync write, kernel will wait for this page to be flushed before
 	 * osc_io_end() is called, so release it earlier.
 	 * for mkwrite(), it's known there is no further pages. */

panic hit constantly with

  1. ONLY=42 REFORMAT=yes sh sanity.sh

typically in 42e test,.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 19/Dec/17 ]

the corresponding osc_extent should be in ACTIVE state so it shouldn't be picked by RPC engine.

Comment by Peter Jones [ 19/Dec/17 ]

Hongchao

Can you please look into this?

Peter

Comment by Gerrit Updater [ 12/Jan/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/30848
Subject: LU-10407 osc: update size before queue page
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13031e6ee6eccedfb0b108115599d3d69e4577c9

Comment by Hongchao Zhang [ 29/Jan/18 ]

Hi Alex,

Which patch do you apply when the issue is triggered?

Delaying the size update will cause the issue for there could some extent is full and released to be ready to write out
(its state will be changed to OES_CACHE) during the loop in "osc_io_commit_async". but it won't if the size is updated
right after the "osc_page_cache_add" for the current active extent will be checked to be non-full before the page is added to it
(its state will be OES_ACTIVE)

Comment by Oleg Drokin [ 28/Aug/18 ]

I just hit this in master-next running racer

[35616.314978] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: 
[35616.322765] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) LBUG
[35616.326801] Pid: 11162, comm: ldlm_bl_05 3.10.0-7.5-debug #1 SMP Sun Jun 3 13:35:38 EDT 2018
[35616.329285] Call Trace:
[35616.330443]  [<ffffffffa01eb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[35616.331870]  [<ffffffffa01eb88c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[35616.334288]  [<ffffffffa0d3eda6>] osc_extent_make_ready+0x936/0xe70 [osc]
[35616.335640]  [<ffffffffa0d45ab3>] osc_cache_writeback_range+0x4f3/0x1260 [osc]
[35616.337641]  [<ffffffffa0e0bac7>] mdc_lock_flush+0x2e7/0x3f0 [mdc]
[35616.339070]  [<ffffffffa0e0bfb4>] mdc_ldlm_blocking_ast+0x2f4/0x3f0 [mdc]
[35616.343073]  [<ffffffffa0b01bd4>] ldlm_cancel_callback+0x84/0x320 [ptlrpc]
[35616.344357]  [<ffffffffa0b189b0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
[35616.345719]  [<ffffffffa0b1e7e7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
[35616.347470]  [<ffffffffa0e0be4a>] mdc_ldlm_blocking_ast+0x18a/0x3f0 [mdc]
[35616.348765]  [<ffffffffa0b2a67f>] ldlm_handle_bl_callback+0xff/0x530 [ptlrpc]
[35616.351095]  [<ffffffffa0b2afb1>] ldlm_bl_thread_main+0x501/0x680 [ptlrpc]
[35616.352495]  [<ffffffff810ae864>] kthread+0xe4/0xf0
[35616.353757]  [<ffffffff81783777>] ret_from_fork_nospec_end+0x0/0x39
Comment by James A Simmons [ 10/Feb/20 ]

One of the patches that landed during 2.13 fixed this issue. Should be close this?

Comment by Alex Zhuravlev [ 11/Jan/21 ]

-https://testing.whamcloud.com/test_sessions/0f2a364a-716a-4eec-9786-3fa81b3143c2-

Moved to LU-14326, closing this one for 2.13 per previous comment.

Generated at Sat Feb 10 02:34:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.