Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16159

remove update llog files after recovery abort

Details

    • 2
    • 9223372036854775807

    Description

      Once recovery is aborted, the existing update logs should be removed, because they are used for recovery only, and if it's corrupt, or inaccessible, if they are kept there after recovery abort, the next recovery will meet with the same issue again, besides, the log file size may become large, retrieving them in recovery may lead to recovery timeout.

      Besides this, in LOD device initialization, it should do some sanity check for all update logs on it, if the log file is not accessible (e.g. OI mapping mismatch, which may lead to recovery timeout too), delete the log file FID from update catalog (do not delete this file at the moment because it's inaccessible) so that the log file is not visible to others.

      Attachments

        Issue Links

          Activity

            [LU-16159] remove update llog files after recovery abort

            But won't that defeat the whole purpose of the fix? Could something else be done, like forcing rollover to new logs and then cancelling the old logs, so that in-use logs are not removed?

            adilger Andreas Dilger added a comment - But won't that defeat the whole purpose of the fix? Could something else be done, like forcing rollover to new logs and then cancelling the old logs, so that in-use logs are not removed?
            laisiyao Lai Siyao added a comment -

            Yes, the abort_recov_mdt can be run at any time, but update log cancel will only be done before request/update/ replay, which means, if request/replay has started, abort_recv_mdt won't cancel update logs.

            laisiyao Lai Siyao added a comment - Yes, the abort_recov_mdt can be run at any time, but update log cancel will only be done before request/update/ replay, which means, if request/replay has started, abort_recv_mdt won't cancel update logs.

            Lai, it isn't clear if we can control when abort_recov_mdt is run (before or after recovery is done), because administrator may run this command tens of minutes after the MDS has restarted and recovery is hung, so it is likely that update log recovery is already finished (as much as possible) at this point.

            I looked through the Maloo results, and did see two crashes that look like LU-15139 "dt_record_write() ASSERTION( dt->do_body_ops->dbo_write )", per Alex's comment above. Also, patch https://review.whamcloud.com/49335 "LU-16335 test: add fail_abort_cleanup()" v1 crashed 4 times with this LASSERT and I couldn't see any functional difference between the v1 of the patch and the current v3, nor in the parent patch https://review.whamcloud.com/49329 "LU-16335 mdt: skip target check for rm_entry".

            Other replay-single crashes in the past few days can be attributed to the other patches themselves.

            adilger Andreas Dilger added a comment - Lai, it isn't clear if we can control when abort_recov_mdt is run (before or after recovery is done), because administrator may run this command tens of minutes after the MDS has restarted and recovery is hung, so it is likely that update log recovery is already finished (as much as possible) at this point. I looked through the Maloo results, and did see two crashes that look like LU-15139 " dt_record_write() ASSERTION( dt->do_body_ops->dbo_write ) ", per Alex's comment above. Also, patch https://review.whamcloud.com/49335 " LU-16335 test: add fail_abort_cleanup() " v1 crashed 4 times with this LASSERT and I couldn't see any functional difference between the v1 of the patch and the current v3, nor in the parent patch https://review.whamcloud.com/49329 " LU-16335 mdt: skip target check for rm_entry ". Other replay-single crashes in the past few days can be attributed to the other patches themselves.
            laisiyao Lai Siyao added a comment -

            It looks we shouldn't cancel update logs if some update logs have been replayed, because it may involve update log write, while update log cancel may destroy empty logs. I will push a patch to move update log cancel before request/update replay.

            laisiyao Lai Siyao added a comment - It looks we shouldn't cancel update logs if some update logs have been replayed, because it may involve update log write, while update log cancel may destroy empty logs. I will push a patch to move update log cancel before request/update replay.

            also, I noticed number of assertions I haven't seen few quite a while:

            [ 6613.485294] Lustre: DEBUG MARKER: == replay-single test 119: timeout of normal replay does not cause DNE replay fails ========================================================== 12:56:09 (1671108969)
            ...
            [ 6692.155879] Lustre: 300726:0:(ldlm_lib.c:2305:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
            [ 6692.156118] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            [ 6692.156264] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) Skipped 2 previous similar messages
            [ 6692.180534] Lustre: lustre-MDT0000-osd: cancel update llog [0x2000320e0:0x1:0x0]
            [ 6692.242056] Lustre: lustre-MDT0001-osp-MDT0000: cancel update llog [0x24000c369:0x1:0x0]
            [ 6692.276833] Lustre: 300726:0:(ldlm_lib.c:2859:target_recovery_thread()) too long recovery - read logs
            [ 6692.282517] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x24001212a:0x2:0x0] doesn't exit
            [ 6692.282752] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) LBUG
            [ 6692.282914] Pid: 7048, comm: ll_ost_out00_00 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022
            [ 6692.283111] Call Trace TBD:
            [ 6692.283190] [<0>] libcfs_call_trace+0x67/0x90 [libcfs]
            [ 6692.283312] [<0>] lbug_with_loc+0x3e/0x80 [libcfs]
            [ 6692.283488] [<0>] out_write_add_exec+0x175/0x1e0 [ptlrpc]
            [ 6692.283652] [<0>] out_write+0x161/0x380 [ptlrpc]
            [ 6692.283810] [<0>] out_handle+0x16c0/0x23b0 [ptlrpc]
            [ 6692.283970] [<0>] tgt_request_handle+0x977/0x1a40 [ptlrpc]
            [ 6692.284121] [<0>] ptlrpc_main+0x1724/0x32c0 [ptlrpc]
            [ 6692.284243] [<0>] kthread+0x129/0x140
            
            bzzz Alex Zhuravlev added a comment - also, I noticed number of assertions I haven't seen few quite a while: [ 6613.485294] Lustre: DEBUG MARKER: == replay-single test 119: timeout of normal replay does not cause DNE replay fails ========================================================== 12:56:09 (1671108969) ... [ 6692.155879] Lustre: 300726:0:(ldlm_lib.c:2305:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [ 6692.156118] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 6692.156264] Lustre: 300726:0:(ldlm_lib.c:2315:target_recovery_overseer()) Skipped 2 previous similar messages [ 6692.180534] Lustre: lustre-MDT0000-osd: cancel update llog [0x2000320e0:0x1:0x0] [ 6692.242056] Lustre: lustre-MDT0001-osp-MDT0000: cancel update llog [0x24000c369:0x1:0x0] [ 6692.276833] Lustre: 300726:0:(ldlm_lib.c:2859:target_recovery_thread()) too long recovery - read logs [ 6692.282517] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x24001212a:0x2:0x0] doesn't exit [ 6692.282752] LustreError: 7048:0:(dt_object.h:2296:dt_declare_record_write()) LBUG [ 6692.282914] Pid: 7048, comm: ll_ost_out00_00 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022 [ 6692.283111] Call Trace TBD: [ 6692.283190] [<0>] libcfs_call_trace+0x67/0x90 [libcfs] [ 6692.283312] [<0>] lbug_with_loc+0x3e/0x80 [libcfs] [ 6692.283488] [<0>] out_write_add_exec+0x175/0x1e0 [ptlrpc] [ 6692.283652] [<0>] out_write+0x161/0x380 [ptlrpc] [ 6692.283810] [<0>] out_handle+0x16c0/0x23b0 [ptlrpc] [ 6692.283970] [<0>] tgt_request_handle+0x977/0x1a40 [ptlrpc] [ 6692.284121] [<0>] ptlrpc_main+0x1724/0x32c0 [ptlrpc] [ 6692.284243] [<0>] kthread+0x129/0x140

            Note that patch https://review.whamcloud.com/49335 "LU-16335 test: add fail_abort_cleanup()" is already adding the stack_trap cleanup of the bad directory, so the only thing still needed after that is LFSCK to fix it properly (LU-16336).

            adilger Andreas Dilger added a comment - Note that patch https://review.whamcloud.com/49335 " LU-16335 test: add fail_abort_cleanup() " is already adding the stack_trap cleanup of the bad directory, so the only thing still needed after that is LFSCK to fix it properly ( LU-16336 ).

            I pushed a patch under LU-16398 to disable this subtest until it is fixed. I think the right fix is to use rm_entry in a stack_trap at the end of test_100c, when rm_entry is fixed, and then more properly for LFSCK to actually make the directory "usable" again so that it can be removed with rmdir.

            adilger Andreas Dilger added a comment - I pushed a patch under LU-16398 to disable this subtest until it is fixed. I think the right fix is to use rm_entry in a stack_trap at the end of test_100c, when rm_entry is fixed, and then more properly for LFSCK to actually make the directory "usable" again so that it can be removed with rmdir.
            laisiyao Lai Siyao added a comment -

            In replay-single.sh formatall is called before check_and_cleanup_lustre:

            5040 (( $MDS1_VERSION >= $(version_code 2.15.52.63) )) && formatall
            5041 
            5042 complete $SECONDS
            5043 check_and_cleanup_lustre
            
            laisiyao Lai Siyao added a comment - In replay-single.sh formatall is called before check_and_cleanup_lustre: 5040 (( $MDS1_VERSION >= $(version_code 2.15.52.63) )) && formatall 5041 5042 complete $SECONDS 5043 check_and_cleanup_lustre
            bzzz Alex Zhuravlev added a comment - - edited

            MDSCOUNT=2 ONLY=100 MDSSIZE=400000 OSTSIZE=1000000 OSTCOUNT=2 REFORMAT=yes REFORMAT=yes bash replay-single.sh

            == replay-single test complete, duration 135 sec ========= 13:47:49 (1671025669)
            rm: cannot remove '/mnt/lustre/d100c.replay-single': Directory not empty
             replay-single test_135: @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = ./../tests/test-framework.sh:6526:error()
              = ./../tests/test-framework.sh:6010:check_and_cleanup_lustre()
              = replay-single.sh:5043:main()
            

            checked few commits around:

            COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
            624e78ae80      1       0       1       BAD     LU-930 docs: add lfs-rm_entry.8 man page
            d1dbf26afd      1       0       1       BAD     LU-16291 build: make kobj_type constant
            6f74bb60ff      1       0       1       BAD     LU-16205 sec: reserve flag for fid2path for encrypted files
            b054fcd785      1       0       1       BAD     LU-16159 lod: cancel update llogs upon recovery abort
            1819f6006f      5       5       0       GOOD    LU-15801 ldiskfs: Server support for RHEL9
            88bccc4fa4      5       5       0       GOOD    LU-16114 build: Update security_dentry_init_security args
            
            bzzz Alex Zhuravlev added a comment - - edited MDSCOUNT=2 ONLY=100 MDSSIZE=400000 OSTSIZE=1000000 OSTCOUNT=2 REFORMAT=yes REFORMAT=yes bash replay-single.sh == replay-single test complete, duration 135 sec ========= 13:47:49 (1671025669) rm: cannot remove '/mnt/lustre/d100c.replay-single' : Directory not empty replay-single test_135: @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = ./../tests/test-framework.sh:6526:error() = ./../tests/test-framework.sh:6010:check_and_cleanup_lustre() = replay-single.sh:5043:main() checked few commits around: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 624e78ae80 1 0 1 BAD LU-930 docs: add lfs-rm_entry.8 man page d1dbf26afd 1 0 1 BAD LU-16291 build: make kobj_type constant 6f74bb60ff 1 0 1 BAD LU-16205 sec: reserve flag for fid2path for encrypted files b054fcd785 1 0 1 BAD LU-16159 lod: cancel update llogs upon recovery abort 1819f6006f 5 5 0 GOOD LU-15801 ldiskfs: Server support for RHEL9 88bccc4fa4 5 5 0 GOOD LU-16114 build: Update security_dentry_init_security args
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48584/
            Subject: LU-16159 lod: cancel update llogs upon recovery abort
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b054fcd7852f6a22f8ec469ce94ddf6f3331ab34

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48584/ Subject: LU-16159 lod: cancel update llogs upon recovery abort Project: fs/lustre-release Branch: master Current Patch Set: Commit: b054fcd7852f6a22f8ec469ce94ddf6f3331ab34

            People

              laisiyao Lai Siyao
              laisiyao Lai Siyao
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: