[LU-16159] remove update llog files after recovery abort - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- llog
- recovery

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Once recovery is aborted, the existing update logs should be removed, because they are used for recovery only, and if it's corrupt, or inaccessible, if they are kept there after recovery abort, the next recovery will meet with the same issue again, besides, the log file size may become large, retrieving them in recovery may lead to recovery timeout.

Besides this, in LOD device initialization, it should do some sanity check for all update logs on it, if the log file is not accessible (e.g. OI mapping mismatch, which may lead to recovery timeout too), delete the log file FID from update catalog (do not delete this file at the moment because it's inaccessible) so that the log file is not visible to others.

Attachments

Issue Links

is related to

LU-16398 ost-pools: FAIL: remove sub-test dirs failed

Resolved

LU-7011 Kernel part of llog subsystem can do self-repairing in some cases

Resolved

LU-17365 steady LOD update llog connection

Resolved

is related to

LU-16336 LFSCK should fix inconsistencies caused by recovery abort

Open

LU-15934 client refused mount with -EAGAIN because of missing MDT-MDT llog connection

Resolved

LU-16335 "lfs rm_entry" failed to remove broken directories

Resolved

(1 is related to )

Activity

[LU-16159] remove update llog files after recovery abort

Andreas Dilger added a comment - 20/May/23 12:35 AM

Is there anything left for this ticket, or can it be resolved?

Andreas Dilger added a comment - 20/May/23 12:35 AM Is there anything left for this ticket, or can it be resolved?

Gerrit Updater added a comment - 03/Feb/23 6:51 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49787/
Subject: ~~LU-16159~~ osp: destroy should not overtake writes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5a5bd5b4dafaf252c641b8afd2cd809de7384f4f

Gerrit Updater added a comment - 03/Feb/23 6:51 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49787/ Subject: LU-16159 osp: destroy should not overtake writes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5a5bd5b4dafaf252c641b8afd2cd809de7384f4f

Xing Huang added a comment - 28/Jan/23 8:12 AM

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49787
Subject: ~~LU-16159~~ osp: destroy should not overtake writes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0e0993837d0e13bd31d53daea04e7259be6c1c4c

Xing Huang added a comment - 28/Jan/23 8:12 AM "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49787 Subject: LU-16159 osp: destroy should not overtake writes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0e0993837d0e13bd31d53daea04e7259be6c1c4c

Gerrit Updater added a comment - 14/Jan/23 12:40 AM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49633
Subject: ~~LU-16159~~ tests: cleanup replay-single code style
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cb4e810ddfb11aab30ea6ad6b40ff371c04ddac1

Gerrit Updater added a comment - 14/Jan/23 12:40 AM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49633 Subject: LU-16159 tests: cleanup replay-single code style Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cb4e810ddfb11aab30ea6ad6b40ff371c04ddac1

Andreas Dilger added a comment - 13/Jan/23 2:03 AM

Reopening this so that it is being tracked for 2.16 due to latest patch.

Andreas Dilger added a comment - 13/Jan/23 2:03 AM Reopening this so that it is being tracked for 2.16 due to latest patch.

Gerrit Updater added a comment - 29/Dec/22 1:38 AM

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49526
Subject: ~~LU-16159~~ target: race in update log cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 983a53137bfd26e6baaba4575a9dbb379e358b76

Gerrit Updater added a comment - 29/Dec/22 1:38 AM "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49526 Subject: LU-16159 target: race in update log cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 983a53137bfd26e6baaba4575a9dbb379e358b76

Andreas Dilger added a comment - 16/Dec/22 1:34 AM

But won't that defeat the whole purpose of the fix? Could something else be done, like forcing rollover to new logs and then cancelling the old logs, so that in-use logs are not removed?

Andreas Dilger added a comment - 16/Dec/22 1:34 AM But won't that defeat the whole purpose of the fix? Could something else be done, like forcing rollover to new logs and then cancelling the old logs, so that in-use logs are not removed?

Lai Siyao added a comment - 16/Dec/22 12:59 AM

Yes, the abort_recov_mdt can be run at any time, but update log cancel will only be done before request/update/ replay, which means, if request/replay has started, abort_recv_mdt won't cancel update logs.

Lai Siyao added a comment - 16/Dec/22 12:59 AM Yes, the abort_recov_mdt can be run at any time, but update log cancel will only be done before request/update/ replay, which means, if request/replay has started, abort_recv_mdt won't cancel update logs.

Andreas Dilger added a comment - 15/Dec/22 9:49 PM

Lai, it isn't clear if we can control when abort_recov_mdt is run (before or after recovery is done), because administrator may run this command tens of minutes after the MDS has restarted and recovery is hung, so it is likely that update log recovery is already finished (as much as possible) at this point.

I looked through the Maloo results, and did see two crashes that look like ~~LU-15139~~ "dt_record_write() ASSERTION( dt->do_body_ops->dbo_write )", per Alex's comment above. Also, patch https://review.whamcloud.com/49335 "LU-16335 test: add fail_abort_cleanup()" v1 crashed 4 times with this LASSERT and I couldn't see any functional difference between the v1 of the patch and the current v3, nor in the parent patch https://review.whamcloud.com/49329 "LU-16335 mdt: skip target check for rm_entry".

Other replay-single crashes in the past few days can be attributed to the other patches themselves.

Andreas Dilger added a comment - 15/Dec/22 9:49 PM Lai, it isn't clear if we can control when abort_recov_mdt is run (before or after recovery is done), because administrator may run this command tens of minutes after the MDS has restarted and recovery is hung, so it is likely that update log recovery is already finished (as much as possible) at this point. I looked through the Maloo results, and did see two crashes that look like LU-15139 " dt_record_write() ASSERTION( dt->do_body_ops->dbo_write ) ", per Alex's comment above. Also, patch https://review.whamcloud.com/49335 " LU-16335 test: add fail_abort_cleanup() " v1 crashed 4 times with this LASSERT and I couldn't see any functional difference between the v1 of the patch and the current v3, nor in the parent patch https://review.whamcloud.com/49329 " LU-16335 mdt: skip target check for rm_entry ". Other replay-single crashes in the past few days can be attributed to the other patches themselves.

Lai Siyao added a comment - 15/Dec/22 2:52 PM

It looks we shouldn't cancel update logs if some update logs have been replayed, because it may involve update log write, while update log cancel may destroy empty logs. I will push a patch to move update log cancel before request/update replay.

Lai Siyao added a comment - 15/Dec/22 2:52 PM It looks we shouldn't cancel update logs if some update logs have been replayed, because it may involve update log write, while update log cancel may destroy empty logs. I will push a patch to move update log cancel before request/update replay.

Alex Zhuravlev added a comment - 15/Dec/22 1:30 PM

also, I noticed number of assertions I haven't seen few quite a while:

[ 6613.485294] Lustre: DEBUG MARKER: == replay-single test 119: timeout of normal replay does not cause DNE replay fails ========================================================== 12:56:09 (1671108969)
...
[ 6692.155879] Lustre: 300726:0:(ldlm_lib.c:2305:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[ 6692.156118] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 6692.156264] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) Skipped 2 previous similar messages
[ 6692.180534] Lustre: lustre-MDT0000-osd: cancel update llog [0x2000320e0:0x1:0x0]
[ 6692.242056] Lustre: lustre-MDT0001-osp-MDT0000: cancel update llog [0x24000c369:0x1:0x0]
[ 6692.276833] Lustre: 300726:0:(ldlm_lib.c:2859:target_recovery_thread()) too long recovery - read logs
[ 6692.282517] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x24001212a:0x2:0x0] doesn't exit
[ 6692.282752] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) LBUG
[ 6692.282914] Pid: 7048, comm: ll_ost_out00_00 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022
[ 6692.283111] Call Trace TBD:
[ 6692.283190] [<0>] libcfs_call_trace+0x67/0x90 [libcfs]
[ 6692.283312] [<0>] lbug_with_loc+0x3e/0x80 [libcfs]
[ 6692.283488] [<0>] out_write_add_exec+0x175/0x1e0 [ptlrpc]
[ 6692.283652] [<0>] out_write+0x161/0x380 [ptlrpc]
[ 6692.283810] [<0>] out_handle+0x16c0/0x23b0 [ptlrpc]
[ 6692.283970] [<0>] tgt_request_handle+0x977/0x1a40 [ptlrpc]
[ 6692.284121] [<0>] ptlrpc_main+0x1724/0x32c0 [ptlrpc]
[ 6692.284243] [<0>] kthread+0x129/0x140

Alex Zhuravlev added a comment - 15/Dec/22 1:30 PM also, I noticed number of assertions I haven't seen few quite a while: [ 6613.485294] Lustre: DEBUG MARKER: == replay-single test 119: timeout of normal replay does not cause DNE replay fails ========================================================== 12:56:09 (1671108969) ... [ 6692.155879] Lustre: 300726:0:(ldlm_lib.c:2305:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [ 6692.156118] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 6692.156264] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) Skipped 2 previous similar messages [ 6692.180534] Lustre: lustre-MDT0000-osd: cancel update llog [0x2000320e0:0x1:0x0] [ 6692.242056] Lustre: lustre-MDT0001-osp-MDT0000: cancel update llog [0x24000c369:0x1:0x0] [ 6692.276833] Lustre: 300726:0:(ldlm_lib.c:2859:target_recovery_thread()) too long recovery - read logs [ 6692.282517] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x24001212a:0x2:0x0] doesn't exit [ 6692.282752] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) LBUG [ 6692.282914] Pid: 7048, comm: ll_ost_out00_00 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022 [ 6692.283111] Call Trace TBD: [ 6692.283190] [<0>] libcfs_call_trace+0x67/0x90 [libcfs] [ 6692.283312] [<0>] lbug_with_loc+0x3e/0x80 [libcfs] [ 6692.283488] [<0>] out_write_add_exec+0x175/0x1e0 [ptlrpc] [ 6692.283652] [<0>] out_write+0x161/0x380 [ptlrpc] [ 6692.283810] [<0>] out_handle+0x16c0/0x23b0 [ptlrpc] [ 6692.283970] [<0>] tgt_request_handle+0x977/0x1a40 [ptlrpc] [ 6692.284121] [<0>] ptlrpc_main+0x1724/0x32c0 [ptlrpc] [ 6692.284243] [<0>] kthread+0x129/0x140

People

Assignee:: Lai Siyao

Reporter:: Lai Siyao

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 14/Sep/22 2:26 PM

Updated:: 18/Mar/25 7:19 AM

Resolved:: 26/Jun/24 12:13 PM