[LU-11463] DOM: osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed Created: 03/Oct/18  Updated: 23/Jan/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Upstream

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10407 osc_cache.c:1141:osc_extent_make_read... Resolved
is related to LU-12511 Prepare lustre for adoption into the ... Open
is related to LU-14326 sanity-dom test_fsx: crash in osc_ext... Resolved
is related to LU-13992 osc_extent_make_ready()) ASSERTION( ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Separate DOM-related issue from the original one in LU-10407 because there can be different reasons.



 Comments   
Comment by Mikhail Pershin [ 03/Oct/18 ]

Copy of message from Oleg about this issue:

[35616.314978] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: 
[35616.322765] LustreError: 11162:0:(osc_cache.c:1141:osc_extent_make_ready()) LBUG
[35616.326801] Pid: 11162, comm: ldlm_bl_05 3.10.0-7.5-debug #1 SMP Sun Jun 3 13:35:38 EDT 2018
[35616.329285] Call Trace:
[35616.330443]  [<ffffffffa01eb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[35616.331870]  [<ffffffffa01eb88c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[35616.334288]  [<ffffffffa0d3eda6>] osc_extent_make_ready+0x936/0xe70 [osc]
[35616.335640]  [<ffffffffa0d45ab3>] osc_cache_writeback_range+0x4f3/0x1260 [osc]
[35616.337641]  [<ffffffffa0e0bac7>] mdc_lock_flush+0x2e7/0x3f0 [mdc]
[35616.339070]  [<ffffffffa0e0bfb4>] mdc_ldlm_blocking_ast+0x2f4/0x3f0 [mdc]
[35616.343073]  [<ffffffffa0b01bd4>] ldlm_cancel_callback+0x84/0x320 [ptlrpc]
[35616.344357]  [<ffffffffa0b189b0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
[35616.345719]  [<ffffffffa0b1e7e7>] ldlm_cli_cancel+0x157/0x620 [ptlrpc]
[35616.347470]  [<ffffffffa0e0be4a>] mdc_ldlm_blocking_ast+0x18a/0x3f0 [mdc]
[35616.348765]  [<ffffffffa0b2a67f>] ldlm_handle_bl_callback+0xff/0x530 [ptlrpc]
[35616.351095]  [<ffffffffa0b2afb1>] ldlm_bl_thread_main+0x501/0x680 [ptlrpc]
[35616.352495]  [<ffffffff810ae864>] kthread+0xe4/0xf0
[35616.353757]  [<ffffffff81783777>] ret_from_fork_nospec_end+0x0/0x39
Comment by James A Simmons [ 07/Jan/20 ]

I can easily reproduce this with the linux lustre client

Comment by Mikhail Pershin [ 08/Jan/20 ]

James, this issue wasn't seen for a long time, looks like client is missing some related fixes. Where can I check its commit history?

Comment by James A Simmons [ 08/Jan/20 ]

Currently I have a local kernel tree that is up to Lustre 2.12.0. Do you know patches fixed this issue?

Comment by James A Simmons [ 10/Feb/20 ]

One of the patches that landed during 2.13 fixed this issue. We can reopen if it shows up again.

Comment by James A Simmons [ 26/Feb/20 ]

Sorry but add direct I/O support to the linux client exposes this problem again. I can easily reproduce it with sanity.sh test 

Comment by Mikhail Pershin [ 26/Feb/20 ]

James, could you describe how to reproduce that, is that happening in some particular test or occasionally? Any specific parameters for testing?

Comment by Arshad Hussain [ 12/Apr/20 ]

James, if possible could you please point me the patch that fixed it initially. I also faced this while doing testing locally of my own patch (https://review.whamcloud.com/#/c/9275/) test 150b triggered it. I want to get hints what and how it got fixed. This is on the latest master : 742897a LU-13274 uapi: make lnet UAPI headers C99 compliant

[ 6559.142453] LustreError: 18592:0:(osc_cache.c:1124:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed:
[ 6559.142454] LustreError: 18592:0:(osc_cache.c:1124:osc_extent_make_ready()) LBUG
[ 6559.142456] Pid: 18592, comm: ptlrpcd_00_01 3.10.0-957.el7_lustre.x86_64 #1 SMP Fri Dec 21 21:49:33 UTC 2018
[ 6559.142456] Call Trace:
[ 6559.142477] [<ffffffffc0475e4c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 6559.142481] [<ffffffffc0475efc>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 6559.142490] [<ffffffffc0b6334c>] osc_extent_make_ready+0xb4c/0xe40 [osc]
[ 6559.142496] [<ffffffffc0b662eb>] osc_io_unplug0+0xe9b/0x18b0 [osc]
[ 6559.142501] [<ffffffffc0b3f773>] brw_queue_work+0x33/0xd0 [osc]
[ 6559.142537] [<ffffffffc08f27fa>] work_interpreter+0x3a/0xf0 [ptlrpc]
[ 6559.142561] [<ffffffffc08fb220>] ptlrpc_check_set+0x510/0x1ed0 [ptlrpc]
[ 6559.142587] [<ffffffffc0929e0b>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
[ 6559.142611] [<ffffffffc092a0a0>] ptlrpcd+0x1b0/0x6a0 [ptlrpc]
[ 6559.142614] [<ffffffff876c1c31>] kthread+0xd1/0xe0
[ 6559.142617] [<ffffffff87d74c37>] ret_from_fork_nospec_end+0x0/0x39
[ 6559.142631] [<ffffffffffffffff>] 0xffffffffffffffff
[ 6559.142631] Kernel panic - not syncing: LBUG
[ 6559.142634] CPU: 1 PID: 18592 Comm: ptlrpcd_00_01 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7_lustre.x86_64 #1
[ 6559.142635] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 6559.142635] Call Trace:
[ 6559.142639] [<ffffffff87d61dc1>] dump_stack+0x19/0x1b
[ 6559.142641] [<ffffffff87d5b4d0>] panic+0xe8/0x21f
[ 6559.142646] [<ffffffffc0475f4b>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 6559.142653] [<ffffffffc0b6334c>] osc_extent_make_ready+0xb4c/0xe40 [osc]
[ 6559.142672] [<ffffffffc0b662eb>] osc_io_unplug0+0xe9b/0x18b0 [osc]
[ 6559.142677] [<ffffffffc0b3f773>] brw_queue_work+0x33/0xd0 [osc]
[ 6559.142699] [<ffffffffc08f27fa>] work_interpreter+0x3a/0xf0 [ptlrpc]
[ 6559.142721] [<ffffffffc08fb220>] ptlrpc_check_set+0x510/0x1ed0 [ptlrpc]
[ 6559.142746] [<ffffffffc0929e0b>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
[ 6559.142769] [<ffffffffc092a0a0>] ptlrpcd+0x1b0/0x6a0 [ptlrpc]
[ 6559.142772] [<ffffffff8762a59e>] ? __switch_to+0xce/0x580
[ 6559.142773] [<ffffffff876c2d00>] ? wake_up_atomic_t+0x30/0x30
[ 6559.142796] [<ffffffffc0929ef0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[ 6559.142798] [<ffffffff876c1c31>] kthread+0xd1/0xe0
[ 6559.142799] [<ffffffff876c1b60>] ? insert_kthread_work+0x40/0x40
[ 6559.142801] [<ffffffff87d74c37>] ret_from_fork_nospec_begin+0x21/0x21
[ 6559.142802] [<ffffffff876c1b60>] ? insert_kthread_work+0x40/0x40
Comment by Alex Zhuravlev [ 11/Jan/21 ]

https://testing.whamcloud.com/test_sessions/0f2a364a-716a-4eec-9786-3fa81b3143c2

Comment by James A Simmons [ 11/Jan/21 ]

I can easily reproduce it with the Linux client (https://github.com/jasimmons1973/lustre). If you run sanity test 398b it crashes every time. Also sanity-flr.sh test 200.

Comment by Mikhail Pershin [ 11/Jan/21 ]

Please note that Alex report and James comment is not about DoM as ticket says, I assume LU-10407 is proper place to report non-DOM issue of this kind

Comment by James A Simmons [ 23/Jan/21 ]

I'm stilling seeing failures with the Linux client. It might be specific to the Linux version so please keep this ticket open.

Generated at Sat Feb 10 02:44:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.