[LU-13614] replay-single test_117: LBUG: ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed Created: 30/May/20  Updated: 20/Aug/23  Resolved: 12/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-11762 replay-single test 0d fails with 'po... Resolved
is related to LU-13339 patch for LU-11762 causes an assertio... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/7ecf29f1-b3ea-45ad-8eaa-e759af2b2c8c

test_117 failed with the following error:

trevis-35vm9 crashed during replay-single test_117

Lustre: 19264:0:(ldlm_lib.c:1893:extend_recovery_timer()) Skipped 1 previous similar message
LustreError: 19264:0:(ldlm_lib.c:2601:replay_request_or_update()) ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed:
LustreError: 19264:0:(ldlm_lib.c:2601:replay_request_or_update()) LBUG
Pid: 19264, comm: tgt_recover_0 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:19:17 UTC 2020
Call Trace:
 [<ffffffffc09901ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
 [<ffffffffc099025c>] lbug_with_loc+0x4c/0xa0 [libcfs]
 [<ffffffffc128bd47>] replay_request_or_update.isra.24+0x867/0x8d0 [ptlrpc]
 [<ffffffffc128c4e5>] target_recovery_thread+0x735/0x11a0 [ptlrpc]
 [<ffffffffbb0c6321>] kthread+0xd1/0xe0
 [<ffffffffbb78ed37>] ret_from_fork_nospec_end+0x0/0x39
 [<ffffffffffffffff>] 0xffffffffffffffff
Kernel panic - not syncing: LBUG
CPU: 0 PID: 19264 Comm: tgt_recover_0 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.18.1.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
 [<ffffffffbb77b416>] dump_stack+0x19/0x1b
 [<ffffffffbb774a0b>] panic+0xe8/0x21f
 [<ffffffffc09902ab>] lbug_with_loc+0x9b/0xa0 [libcfs]
 [<ffffffffc128bd47>] replay_request_or_update.isra.24+0x867/0x8d0 [ptlrpc]
 [<ffffffffc128c4e5>] target_recovery_thread+0x735/0x11a0 [ptlrpc]
 [<ffffffffc128bdb0>] ? replay_request_or_update.isra.24+0x8d0/0x8d0 [ptlrpc]
 [<ffffffffbb0c6321>] kthread+0xd1/0xe0
 [<ffffffffbb0c6250>] ? insert_kthread_work+0x40/0x40
 [<ffffffffbb78ed37>] ret_from_fork_nospec_begin+0x21/0x21
 [<ffffffffbb0c6250>] ? insert_kthread_work+0x40/0x40

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-single test_117 - trevis-35vm9 crashed during replay-single test_117



 Comments   
Comment by Gerrit Updater [ 29/Jul/20 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/39532
Subject: LU-13614 ldlm: revert LU-11762
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ecd083651da5648936acb73170756a8bd3cd57cf

Comment by Vladimir Saveliev [ 29/Jul/20 ]
LustreError: 19264:0:(ldlm_lib.c:2601:replay_request_or_update()) ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed:

https://review.whamcloud.com/#/c/35627/ could be responsible for this assertion.

That patch cares about target_recovery_overseer() so that it did not call

    wait_event_timeout(check_for_next_transno)

with inactive recovery timer.

But that seems to be impossible. If recovery timeout == hard recovery timeout and if the timer is inactive then with goto repeat target_recovery_overseer() goes to:

        if (obd->obd_recovery_start != 0 && ktime_get_seconds() >=
              (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
...

where abort recovery flag will be set.

So, I propose to reverse the "LU-11762 ldlm: ensure the recovery timer is armed".

If I miss something please point where I am wrong.
 

 

 

Comment by Gerrit Updater [ 12/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39532/
Subject: LU-13614 ldlm: revert LU-11762
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2d24238a80be9ca924369d142148d4f6f1891102

Comment by Peter Jones [ 12/Oct/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.