[LU-15645] gap in recovery llog should not be a fatal error Created: 13/Mar/22 Updated: 08/Dec/22 Resolved: 05/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.15.0, Lustre 2.12.10 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
A gap in the MDT recovery llog (of unknown origin) was hit during recovery. log_process_thread()) lfs02-MDT001e-osp-MDT0000: [0x3:0x1b70:0x4] Invalid record: index 16123 but expected 16122 and this was later confirmed with llog_reader: rec #15221 type=106a0000 len=1160 offset 17231040 rec #16097 type=106a0000 len=1160 offset 18220168 rec #16098 type=106a0000 len=1160 offset 18221328 rec #16099 type=106a0000 len=1160 offset 18222488 rec #16100 type=106a0000 len=1160 offset 18223648 Previous index is 16121, current 16123, offset 18249168 rec #18718 type=106a0000 len=1160 offset 21180888 rec #20278 type=106a0000 len=1160 offset 22943400 This caused the MDT recovery to fail and all of the clients were evicted from that MDT. It isn't clear whether the global eviction is necessary, or if this should be handled more gracefully? Other MDTs likely have a copy of that operation for replay, and if not then it would be lost. What is more problematic is that this recovery llog error is persistent, and the same problem happens on every recovery for that MDT. If the clients (and MDTs?) are evicted from recovery, the llog records should at a minimum be cancelled, or the llog file should be cleared. Better yet would be to not treat this gap as a fatal error, since I don't think there is anything that can be done about it at this point. |
| Comments |
| Comment by Alex Zhuravlev [ 14/Mar/22 ] |
|
I think that VBR checks should ensure that there is no real gap in the transaction (otherwise recovery abort is unavoidable). so there are two major scenario here: |
| Comment by Andreas Dilger [ 14/Mar/22 ] |
|
I was wondering about the potential sources of a gap in the recovery llog. As you wrote, if there was an actual gap in the updates applied to the MDT objects, then that should be caught by VBR. I think this is a gap in the numerbering of the OUT records in the llog, which seems different. That might be caused by the llog header being written non-atomically with the llog body, which I recall was a bug that was fixed by Mike a while ago. However, it isn't clear if this gap in the llog numbering is a "real" problem or not? If there are clients waiting on the recovery of this transaction, wouldn't they have it pending replay in their own recovery logs also? In either case, if the clients are evicted, then definitely the recovery log needs to be cleaned up so that this gap does not cause future problems. |
| Comment by Gerrit Updater [ 16/Mar/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46837 |
| Comment by Etienne Aujames [ 18/Mar/22 ] |
|
Hello, |
| Comment by Andreas Dilger [ 18/Mar/22 ] |
|
Etienne, I don't think there is anything done to rewrite the blog with the gap, it is just skipped without causing the recovery to fail. |
| Comment by Gerrit Updater [ 07/Apr/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47011 |
| Comment by Gerrit Updater [ 05/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46837/ |
| Comment by Peter Jones [ 05/May/22 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 20/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47011/ |