Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15938

MDT recovery did not finish due to corrupt llog record

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on:

      llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process
      

      There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry.

      Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish.

      Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: