[LU-13614] replay-single test_117: LBUG: ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed Created: 30/May/20 Updated: 20/Aug/23 Resolved: 12/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Vladimir Saveliev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/7ecf29f1-b3ea-45ad-8eaa-e759af2b2c8c test_117 failed with the following error: trevis-35vm9 crashed during replay-single test_117 Lustre: 19264:0:(ldlm_lib.c:1893:extend_recovery_timer()) Skipped 1 previous similar message LustreError: 19264:0:(ldlm_lib.c:2601:replay_request_or_update()) ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed: LustreError: 19264:0:(ldlm_lib.c:2601:replay_request_or_update()) LBUG Pid: 19264, comm: tgt_recover_0 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:19:17 UTC 2020 Call Trace: [<ffffffffc09901ac>] libcfs_call_trace+0x8c/0xc0 [libcfs] [<ffffffffc099025c>] lbug_with_loc+0x4c/0xa0 [libcfs] [<ffffffffc128bd47>] replay_request_or_update.isra.24+0x867/0x8d0 [ptlrpc] [<ffffffffc128c4e5>] target_recovery_thread+0x735/0x11a0 [ptlrpc] [<ffffffffbb0c6321>] kthread+0xd1/0xe0 [<ffffffffbb78ed37>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff Kernel panic - not syncing: LBUG CPU: 0 PID: 19264 Comm: tgt_recover_0 Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: [<ffffffffbb77b416>] dump_stack+0x19/0x1b [<ffffffffbb774a0b>] panic+0xe8/0x21f [<ffffffffc09902ab>] lbug_with_loc+0x9b/0xa0 [libcfs] [<ffffffffc128bd47>] replay_request_or_update.isra.24+0x867/0x8d0 [ptlrpc] [<ffffffffc128c4e5>] target_recovery_thread+0x735/0x11a0 [ptlrpc] [<ffffffffc128bdb0>] ? replay_request_or_update.isra.24+0x8d0/0x8d0 [ptlrpc] [<ffffffffbb0c6321>] kthread+0xd1/0xe0 [<ffffffffbb0c6250>] ? insert_kthread_work+0x40/0x40 [<ffffffffbb78ed37>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffffbb0c6250>] ? insert_kthread_work+0x40/0x40 <<Please provide additional information about the failure here>> VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gerrit Updater [ 29/Jul/20 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/39532 |
| Comment by Vladimir Saveliev [ 29/Jul/20 ] |
https://review.whamcloud.com/#/c/35627/ could be responsible for this assertion. That patch cares about target_recovery_overseer() so that it did not call
wait_event_timeout(check_for_next_transno)
with inactive recovery timer. But that seems to be impossible. If recovery timeout == hard recovery timeout and if the timer is inactive then with goto repeat target_recovery_overseer() goes to:
if (obd->obd_recovery_start != 0 && ktime_get_seconds() >=
(obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
...
where abort recovery flag will be set. So, I propose to reverse the " If I miss something please point where I am wrong.
|
| Comment by Gerrit Updater [ 12/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39532/ |
| Comment by Peter Jones [ 12/Oct/20 ] |
|
Landed for 2.14 |