Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15938

MDT recovery did not finish due to corrupt llog record

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on:

      llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process
      

      There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry.

      Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish.

      Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file.

      Attachments

        Issue Links

          Activity

            [LU-15938] MDT recovery did not finish due to corrupt llog record
            pjones Peter Jones added a comment -

            All patches seem to have merged for 2.16

            pjones Peter Jones added a comment - All patches seem to have merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/
            Subject: LU-15938 llog: more checks in llog_reader
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 386ffcdbb4c9b89f798de4c83a51a3f020542c8b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/ Subject: LU-15938 llog: more checks in llog_reader Project: fs/lustre-release Branch: master Current Patch Set: Commit: 386ffcdbb4c9b89f798de4c83a51a3f020542c8b

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341
            Subject: LU-15938 llog: Fix chunk re-read case in llog_process_thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a7015dccd3e960516c95510663626f075191d4bd

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341 Subject: LU-15938 llog: Fix chunk re-read case in llog_process_thread Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a7015dccd3e960516c95510663626f075191d4bd

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 02ea0e325eabc57d95051e79ffe1cc87c2243ced

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310 Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 02ea0e325eabc57d95051e79ffe1cc87c2243ced

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286
            Subject: LU-15938 lod: prevent endless retry in recovery thread
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7e2533728b5d574ec8638742cbfe574580c0a063

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286 Subject: LU-15938 lod: prevent endless retry in recovery thread Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7e2533728b5d574ec8638742cbfe574580c0a063

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d914a5b7a49ac6b61c0191a0966d1f684a6957b6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/ Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: master Current Patch Set: Commit: d914a5b7a49ac6b61c0191a0966d1f684a6957b6

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48112
            Subject: LU-15938 llog: more checks in llog_reader
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4b619366a07813394cfb7abf1d79bb9512605401

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48112 Subject: LU-15938 llog: more checks in llog_reader Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b619366a07813394cfb7abf1d79bb9512605401

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47698/
            Subject: LU-15938 lod: prevent endless retry in recovery thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1a24dcdce121787428ea820561cfa16ae24bdf82

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47698/ Subject: LU-15938 lod: prevent endless retry in recovery thread Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1a24dcdce121787428ea820561cfa16ae24bdf82

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47934
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e6f90aa1b8234121d0fc03ce12f98268ff3fcd29

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47934 Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e6f90aa1b8234121d0fc03ce12f98268ff3fcd29

            OK, with more analysis, I check the llog_reader and find out it doesn't report all inconsistencies. Modified version showed problem in update_log:

            # llog_reader update_log 
            rec #7707 type=106a0000 len=1160 offset 8670472
            in bitmap: rec #8245 is set!
            llog has 1 records but header count is 2
            Header size : 32768     llh_size : 496
            Time : Thu Apr  7 12:52:40 2022
            Number of records: 2    cat_idx: 9    last_idx: 8244
            Target uuid : 
            -----------------------
            #7707 (1160) id: 0 updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18
                [0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1 
                [0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0 
                [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4 
                [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4 
                [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8 
                [0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10 
                [0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0 
                [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8 
                [0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13 
                [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8 
                [0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17 
                p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000
                p_1 - 16/\x0E+\x0080120000006\xBE0100000000
                p_2 - 2/.
                p_3 - 0/
                p_4 - 0/
                p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00
                p_6 - 0/
                p_7 - 0/
                p_8 - 0/
                p_9 - 0/
                p_10 - 0/
                p_11 - 0/
                p_12 - 0/
                p_13 - 0/
                p_14 - 0/
                p_15 - 0/
                p_16 - 0/
                p_17 - 0/
            

             this is correct output, llog has only one record #7707, but its bitmap has also bit #8245 set, that is why count is 2 and that is why retry doesn't help, it reads the same bitmap and llog data again and again. I will think how to inject such corruption for test purposes and will add modified llog_reader as well

            tappro Mikhail Pershin added a comment - OK, with more analysis, I check the llog_reader and find out it doesn't report all inconsistencies. Modified version showed problem in update_log: # llog_reader update_log  rec #7707 type=106a0000 len=1160 offset 8670472 in bitmap: rec #8245 is set! llog has 1 records but header count is 2 Header size : 32768     llh_size : 496 Time : Thu Apr  7 12:52:40 2022 Number of records: 2    cat_idx: 9    last_idx: 8244 Target uuid :  ----------------------- #7707 (1160) id: 0 updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18     [0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1      [0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0      [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4      [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4      [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8      [0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10      [0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0      [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8      [0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13      [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8      [0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17      p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000     p_1 - 16/\x0E+\x0080120000006\xBE0100000000     p_2 - 2/.     p_3 - 0/     p_4 - 0/     p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00     p_6 - 0/     p_7 - 0/     p_8 - 0/     p_9 - 0/     p_10 - 0/     p_11 - 0/     p_12 - 0/     p_13 - 0/     p_14 - 0/     p_15 - 0/     p_16 - 0/     p_17 - 0/  this is correct output, llog has only one record #7707, but its bitmap has also bit #8245 set, that is why count is 2 and that is why retry doesn't help, it reads the same bitmap and llog data again and again. I will think how to inject such corruption for test purposes and will add modified llog_reader as well

            Initial approach which resolves endless retry loop we've observed due to remote llog short read handling. Patch also makes obd_abort_recovery_mdt option to abort update recovery threads as well. That should prevent known endless retry cases and make possible manual intervention by using abort_recovery_mdt parameter if update recovery would stuck due to network problems.

            As noted above, more work is needed to remove update llogs upon abort_recovery_mdt setting and it is worth to think about limit for number of retries when remote server is not accessible, so far I have no idea what to choose as basis - recovery hard (or soft) timeout value maybe?

            tappro Mikhail Pershin added a comment - Initial approach which resolves endless retry loop we've observed due to remote llog short read handling. Patch also makes obd_abort_recovery_mdt option to abort update recovery threads as well. That should prevent known endless retry cases and make possible manual intervention by using abort_recovery_mdt parameter if update recovery would stuck due to network problems. As noted above, more work is needed to remove update llogs upon abort_recovery_mdt setting and it is worth to think about limit for number of retries when remote server is not accessible, so far I have no idea what to choose as basis - recovery hard (or soft) timeout value maybe?

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: