Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
3
-
9223372036854775807
Description
A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on:
llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process
There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry.
Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish.
Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file.
Attachments
Issue Links
- is related to
-
LU-16203 zero records and empty plain llogs in update llog catalog
- Resolved
-
LU-16052 conf-sanity test_106: crash after osp_sync_process_queues failed: -53
- Resolved
- is related to
-
LU-15937 lctl llog commands do not work for DNE recovery logs
- Open
-
LU-15761 cannot finish MDS recovery
- Resolved
-
LU-15645 gap in recovery llog should not be a fatal error
- Resolved
-
LU-15934 client refused mount with -EAGAIN because of missing MDT-MDT llog connection
- Resolved
-
LU-15139 sanity test_160h: dt_record_write() ASSERTION( dt->do_body_ops->dbo_write ) failed
- Resolved