[LU-16159] remove update logs after recovery abort Created: 14/Sep/22  Updated: 08/Jan/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Lai Siyao Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16336 LFSCK should fix inconsistencies caus... Open
is related to LU-15934 client refused mount with -EAGAIN bec... Resolved
is related to LU-16335 "lfs rm_entry" failed to remove broke... Resolved
is related to LU-16398 ost-pools: FAIL: remove sub-test dirs... Resolved
is related to LU-17365 steady LOD update llog connection Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Once recovery is aborted, the existing update logs should be removed, because they are used for recovery only, and if it's corrupt, or inaccessible, if they are kept there after recovery abort, the next recovery will meet with the same issue again, besides, the log file size may become large, retrieving them in recovery may lead to recovery timeout.

Besides this, in LOD device initialization, it should do some sanity check for all update logs on it, if the log file is not accessible (e.g. OI mapping mismatch, which may lead to recovery timeout too), delete the log file FID from update catalog (do not delete this file at the moment because it's inaccessible) so that the log file is not visible to others.



 Comments   
Comment by Andreas Dilger [ 15/Sep/22 ]

I think that this is important and will resolve quite a number of problems that we've seen with DNE recovery. Currently, if there is any problem with the recovery llog, it will be kept and retried on every restart, and eventually grows to be very large and causes MDT recovery to be slow and/or fail repeatedly.

Comment by Andreas Dilger [ 20/Sep/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48584
Subject: LU-16159 lod: remove update llogs upon abort_recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 61697dd867a8a8933f3843fe5977de72ec3be91d

Comment by Gerrit Updater [ 05/Nov/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49052
Subject: LU-16159 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: de62d8bb542bfb417422e42671967531c2370dc1

Comment by Gerrit Updater [ 07/Nov/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49059
Subject: LU-16159 lod: debug patch 2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 109c9b8487b800004a4e849aca1a9ad1a3eab6df

Comment by Gerrit Updater [ 13/Dec/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48584/
Subject: LU-16159 lod: cancel update llogs upon recovery abort
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b054fcd7852f6a22f8ec469ce94ddf6f3331ab34

Comment by Peter Jones [ 13/Dec/22 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 14/Dec/22 ]

MDSCOUNT=2 ONLY=100 MDSSIZE=400000 OSTSIZE=1000000 OSTCOUNT=2 REFORMAT=yes REFORMAT=yes bash replay-single.sh

== replay-single test complete, duration 135 sec ========= 13:47:49 (1671025669)
rm: cannot remove '/mnt/lustre/d100c.replay-single': Directory not empty
 replay-single test_135: @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = ./../tests/test-framework.sh:6526:error()
  = ./../tests/test-framework.sh:6010:check_and_cleanup_lustre()
  = replay-single.sh:5043:main()

checked few commits around:

COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
624e78ae80      1       0       1       BAD     LU-930 docs: add lfs-rm_entry.8 man page
d1dbf26afd      1       0       1       BAD     LU-16291 build: make kobj_type constant
6f74bb60ff      1       0       1       BAD     LU-16205 sec: reserve flag for fid2path for encrypted files
b054fcd785      1       0       1       BAD     LU-16159 lod: cancel update llogs upon recovery abort
1819f6006f      5       5       0       GOOD    LU-15801 ldiskfs: Server support for RHEL9
88bccc4fa4      5       5       0       GOOD    LU-16114 build: Update security_dentry_init_security args
Comment by Lai Siyao [ 14/Dec/22 ]

In replay-single.sh formatall is called before check_and_cleanup_lustre:

5040 (( $MDS1_VERSION >= $(version_code 2.15.52.63) )) && formatall
5041 
5042 complete $SECONDS
5043 check_and_cleanup_lustre
Comment by Andreas Dilger [ 14/Dec/22 ]

I pushed a patch under LU-16398 to disable this subtest until it is fixed. I think the right fix is to use rm_entry in a stack_trap at the end of test_100c, when rm_entry is fixed, and then more properly for LFSCK to actually make the directory "usable" again so that it can be removed with rmdir.

Comment by Andreas Dilger [ 14/Dec/22 ]

Note that patch https://review.whamcloud.com/49335 "LU-16335 test: add fail_abort_cleanup()" is already adding the stack_trap cleanup of the bad directory, so the only thing still needed after that is LFSCK to fix it properly (LU-16336).

Comment by Alex Zhuravlev [ 15/Dec/22 ]

also, I noticed number of assertions I haven't seen few quite a while:

[ 6613.485294] Lustre: DEBUG MARKER: == replay-single test 119: timeout of normal replay does not cause DNE replay fails ========================================================== 12:56:09 (1671108969)
...
[ 6692.155879] Lustre: 300726:0:(ldlm_lib.c:2305:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[ 6692.156118] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 6692.156264] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) Skipped 2 previous similar messages
[ 6692.180534] Lustre: lustre-MDT0000-osd: cancel update llog [0x2000320e0:0x1:0x0]
[ 6692.242056] Lustre: lustre-MDT0001-osp-MDT0000: cancel update llog [0x24000c369:0x1:0x0]
[ 6692.276833] Lustre: 300726:0:(ldlm_lib.c:2859:target_recovery_thread()) too long recovery - read logs
[ 6692.282517] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x24001212a:0x2:0x0] doesn't exit
[ 6692.282752] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) LBUG
[ 6692.282914] Pid: 7048, comm: ll_ost_out00_00 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022
[ 6692.283111] Call Trace TBD:
[ 6692.283190] [<0>] libcfs_call_trace+0x67/0x90 [libcfs]
[ 6692.283312] [<0>] lbug_with_loc+0x3e/0x80 [libcfs]
[ 6692.283488] [<0>] out_write_add_exec+0x175/0x1e0 [ptlrpc]
[ 6692.283652] [<0>] out_write+0x161/0x380 [ptlrpc]
[ 6692.283810] [<0>] out_handle+0x16c0/0x23b0 [ptlrpc]
[ 6692.283970] [<0>] tgt_request_handle+0x977/0x1a40 [ptlrpc]
[ 6692.284121] [<0>] ptlrpc_main+0x1724/0x32c0 [ptlrpc]
[ 6692.284243] [<0>] kthread+0x129/0x140
Comment by Lai Siyao [ 15/Dec/22 ]

It looks we shouldn't cancel update logs if some update logs have been replayed, because it may involve update log write, while update log cancel may destroy empty logs. I will push a patch to move update log cancel before request/update replay.

Comment by Andreas Dilger [ 15/Dec/22 ]

Lai, it isn't clear if we can control when abort_recov_mdt is run (before or after recovery is done), because administrator may run this command tens of minutes after the MDS has restarted and recovery is hung, so it is likely that update log recovery is already finished (as much as possible) at this point.

I looked through the Maloo results, and did see two crashes that look like LU-15139 "dt_record_write() ASSERTION( dt->do_body_ops->dbo_write )", per Alex's comment above. Also, patch https://review.whamcloud.com/49335 "LU-16335 test: add fail_abort_cleanup()" v1 crashed 4 times with this LASSERT and I couldn't see any functional difference between the v1 of the patch and the current v3, nor in the parent patch https://review.whamcloud.com/49329 "LU-16335 mdt: skip target check for rm_entry".

Other replay-single crashes in the past few days can be attributed to the other patches themselves.

Comment by Lai Siyao [ 16/Dec/22 ]

Yes, the abort_recov_mdt can be run at any time, but update log cancel will only be done before request/update/ replay, which means, if request/replay has started, abort_recv_mdt won't cancel update logs.

Comment by Andreas Dilger [ 16/Dec/22 ]

But won't that defeat the whole purpose of the fix? Could something else be done, like forcing rollover to new logs and then cancelling the old logs, so that in-use logs are not removed?

Comment by Gerrit Updater [ 29/Dec/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49526
Subject: LU-16159 target: race in update log cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 983a53137bfd26e6baaba4575a9dbb379e358b76

Comment by Andreas Dilger [ 13/Jan/23 ]

Reopening this so that it is being tracked for 2.16 due to latest patch.

Comment by Gerrit Updater [ 14/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49633
Subject: LU-16159 tests: cleanup replay-single code style
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cb4e810ddfb11aab30ea6ad6b40ff371c04ddac1

Comment by Xing Huang [ 28/Jan/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49787
Subject: LU-16159 osp: destroy should not overtake writes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0e0993837d0e13bd31d53daea04e7259be6c1c4c

Comment by Gerrit Updater [ 03/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49787/
Subject: LU-16159 osp: destroy should not overtake writes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5a5bd5b4dafaf252c641b8afd2cd809de7384f4f

Comment by Andreas Dilger [ 20/May/23 ]

Is there anything left for this ticket, or can it be resolved?

Comment by Gerrit Updater [ 05/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51223
Subject: LU-16159 lod: cancel update llogs upon recovery abort
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: a544f6c69fe4b2eb81f4b05581654325ecc96f93

Comment by Gerrit Updater [ 05/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51224
Subject: LU-16159 osp: destroy should not overtake writes
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: acfee963c6999785afdd3b25c83015695099a3b9

Generated at Sat Feb 10 03:24:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.