Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15938

MDT recovery did not finish due to corrupt llog record

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on:

      llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process
      

      There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry.

      Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish.

      Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file.

      Attachments

        Issue Links

          Activity

            [LU-15938] MDT recovery did not finish due to corrupt llog record

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48310/
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1fa6738b6dd56660058cb146629f0d23e36cdc1d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48310/ Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1fa6738b6dd56660058cb146629f0d23e36cdc1d

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51220
            Subject: LU-15938 llog: more checks in llog_reader
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: a0d25b76f6d41e164536a1c1cd46d503338643e7

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51220 Subject: LU-15938 llog: more checks in llog_reader Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a0d25b76f6d41e164536a1c1cd46d503338643e7

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51218
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: e43bf0086e8f80f128cf868b5dca6079872f6a62

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51218 Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: e43bf0086e8f80f128cf868b5dca6079872f6a62

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51217
            Subject: LU-15938 lod: prevent endless retry in recovery thread
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: d66b517c9207dae3dd6da75266e78e50dfbc3f93

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51217 Subject: LU-15938 lod: prevent endless retry in recovery thread Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: d66b517c9207dae3dd6da75266e78e50dfbc3f93
            pjones Peter Jones added a comment -

            All patches seem to have merged for 2.16

            pjones Peter Jones added a comment - All patches seem to have merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/
            Subject: LU-15938 llog: more checks in llog_reader
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 386ffcdbb4c9b89f798de4c83a51a3f020542c8b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/ Subject: LU-15938 llog: more checks in llog_reader Project: fs/lustre-release Branch: master Current Patch Set: Commit: 386ffcdbb4c9b89f798de4c83a51a3f020542c8b

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341
            Subject: LU-15938 llog: Fix chunk re-read case in llog_process_thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a7015dccd3e960516c95510663626f075191d4bd

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341 Subject: LU-15938 llog: Fix chunk re-read case in llog_process_thread Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a7015dccd3e960516c95510663626f075191d4bd

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 02ea0e325eabc57d95051e79ffe1cc87c2243ced

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310 Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 02ea0e325eabc57d95051e79ffe1cc87c2243ced

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286
            Subject: LU-15938 lod: prevent endless retry in recovery thread
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7e2533728b5d574ec8638742cbfe574580c0a063

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286 Subject: LU-15938 lod: prevent endless retry in recovery thread Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7e2533728b5d574ec8638742cbfe574580c0a063

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/
            Subject: LU-15938 llog: llog_reader to detect more corruptions
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d914a5b7a49ac6b61c0191a0966d1f684a6957b6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/ Subject: LU-15938 llog: llog_reader to detect more corruptions Project: fs/lustre-release Branch: master Current Patch Set: Commit: d914a5b7a49ac6b61c0191a0966d1f684a6957b6

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: