[LU-15938] MDT recovery did not finish due to corrupt llog record Created: 12/Jun/22 Updated: 05/Jun/23 Resolved: 08/Dec/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on: llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry. Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish. Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file. |
| Comments |
| Comment by Andreas Dilger [ 12/Jun/22 ] |
|
This is similar to |
| Comment by Alex Zhuravlev [ 12/Jun/22 ] |
|
where can I find that corrupted llog for analysis? |
| Comment by Dmitry Vensko [ 12/Jun/22 ] |
|
attached the requested log |
| Comment by Peter Jones [ 13/Jun/22 ] |
|
Mike Could you please investigate Thanks Peter |
| Comment by Andreas Dilger [ 14/Jun/22 ] |
|
The problematic llog file decoded by llog_reader from master shows: rec #7707 type=106a0000 len=1160 offset 8670472
Header size : 32768 llh_size : 496
Time : Thu Apr 7 10:52:40 2022
Number of records: 2 cat_idx: 9 last_idx: 8244
Target uuid :
-----------------------
#7707 (1160)updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18
[0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1
[0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0
[0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4
[0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8
[0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10
[0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8
[0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8
[0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17
p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000
p_1 - 16/\x0E+\x0080120000006\xBE0100000000
p_2 - 2/.
p_3 - 0/
p_4 - 0/
p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00
p_6 - 0/
p_7 - 0/
p_8 - 0/
p_9 - 0/
p_10 - 0/
p_11 - 0/
p_12 - 0/
p_13 - 0/
p_14 - 0/
p_15 - 0/
p_16 - 0/
p_17 - 0/
#8244 (1160)NOT SET updatelog record master_transno:45508195003 batchid:37129117505 flags:0x0 u_index:0 u_count:11 p_count:18
[0x18400034de:0x6863:0x0] type:create/1 params:2 p_0:0 p_1:1
[0x18400034de:0x6863:0x0] type:ref_add/3 params:0
[0x18400034de:0x6863:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4
[0x18400034de:0x6863:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4
[0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8
[0x128000369b:0x174b8:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10
[0x128000369b:0x174b8:0x0] type:ref_add/3 params:0
[0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8
[0x128000369b:0x174b8:0x0] type:attr_set/5 params:1 p_0:13
[0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8
[0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17
p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000H5ob\x00000000H5ob\x00000000H5ob\x000000
p_1 - 16/\x9B6\x008012000000B8t\x0100000000
p_2 - 2/.
The recovery from the source MDT reported an error when processing record 8245. |
| Comment by Mikhail Pershin [ 14/Jun/22 ] |
|
yes, I've found the same. Log itself is not corrupted, all records are processed by llog_reader without errors, llog size is also fits data in llog. What is not OK so far - "Number of records: 2", first I thought that is OK because one record is llog_header but in fact llog_reader handles that and does decrement. So that is mismatch, llog reports 2 live records but has only 1. I am not sure if that is problem or not. I am checking code to see how -5 could be returned. My idea is that we might have read over llog size somehow Other observation is that used server version misses the following patches: LU-13974 llog: check stale osp object LU-15645 obdclass: llog to handle gaps LU-15761 obdclass: fix locking in llog_cat_refresh() It looks we don't have here direct result of any of them but still it would be better to update |
| Comment by Mikhail Pershin [ 14/Jun/22 ] |
|
just addition to comment above, lgh_count mismatch may be caused by undo operations over llog, if pad record was involved, so that might indicate that there were undo operations over llog |
| Comment by Andreas Dilger [ 14/Jun/22 ] |
|
From the debug logs it shows the MDT is getting -EIO when trying to read the next log, maybe because the llog is short, or it thinks it is short because of the number of records? 00000004:00000001:1.0:1654980511.161770:0:11204:0:(osp_md_object.c:1345:osp_md_read()) Process leaving (rc=4400 : 4400 : 1130) 00000040:00000001:1.0:1654980511.161772:0:11204:0:(llog_osd.c:1089:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00000001:1.0:1654980511.161774:0:11204:0:(lustre_log.h:465:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000040:00000001:1.0:1654980511.161775:0:11204:0:(llog.c:572:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) 00000040:00080000:1.0:1654980511.161777:0:11204:0:(llog.c:761:llog_process_thread()) stop processing plain 0x2:2147506325:0 index 8245 count 3 00000040:00020000:1.0:1654980511.161778:0:11204:0:(llog.c:774:llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process 00000040:00000001:1.0:1654980511.161781:0:11204:0:(llog.c:897:llog_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 00000040:00000001:1.0:1654980511.161782:0:11204:0:(llog_cat.c:904:llog_cat_process_cb()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 00000040:00000001:1.0:1654980511.161784:0:11204:0:(llog.c:742:llog_process_thread()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5) 00000040:00080000:1.0:1654980511.161785:0:11204:0:(llog.c:761:llog_process_thread()) stop processing catalog 0x1:2147484688:0 index 9 count 12 00000040:00000001:1.0:1654980511.161790:0:11204:0:(llog.c:897:llog_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 00000040:00000001:1.0:1654980511.161791:0:11204:0:(llog_cat.c:966:llog_cat_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) |
| Comment by Mikhail Pershin [ 15/Jun/22 ] |
|
so far it looks like MDT0003 has record #8245 in local llog bitmap and is trying to read that records from MDT0019. Meanwhile MDT0019 doesn't have it neither in bitmap nor in file, file size ends exactly at #8244 record. That means that #8245 write request were sent to MDT0019 from the MDT0003, that is why it has it in bitmap but wasn't yet written on MDT0019 due to some reason but concurrent llog processing reading it. As for loop in code, llog_osd_next_block() reads llog up to its end, get last_idx in llog as #8244, since it is less than expected #8245, it does read again, but cycle exists because current offset is equal to llog_size at that moment causing -5 error. I am not yet sure what happens next, if caller would try read again and again then needed update should arrive at some moment and read would succeeded but we don't see that. So either write is just lost completely and never happens or caller doesn't retry that read properly. |
| Comment by Mikhail Pershin [ 15/Jun/22 ] |
|
https://review.whamcloud.com/47003 - could solve read arriving prior related write So, at the moment the first problem to address is to prevent such llog processing loop if needed update is lost (presumably), at least we can't wait for it forever, probably we could check if there is related write inflight or so. Second problem to address is abort_recovery_mdt option, being used it skips update recovery but doesn't remove related update logs, so if we have llog corruption situation which need to be solved, the abort_recovery_mdt doesn't resolve it, every server remount would stuck on the same issues. I think if we decided to abort/skip update recovery there is no sense to keep update logs and they should be truncated. This is to be discussed, probably I am missing some cases when they must stay
|
| Comment by Mikhail Pershin [ 15/Jun/22 ] |
|
speaking of abort_recovery_mdt in the lod_sub_recovery_thread() the cycle breaks only on abort_recovery being set, so abort_recovery_mdt doesn't make things better but also doesn't stop that thread. That could explain why setting abort_recovery_mdt doesn't help with stopping stuck MDT-MDT recovery |
| Comment by Andreas Dilger [ 15/Jun/22 ] |
|
Ok, so two problems then. The llog is permanently short for whatever reason (posdibly old non-atomic writes?), so retrying forever will never fix it. Having a few retries might be OK in case the write is in progress. Maybe we need a separate error code for a short read? Secondly, abort_revov_mdt should definitely abort the MDT recovery. It also makes sense that this will clear or reset or delete the recovery logs so that they don't try to be re-used on the next recovery. If recovery is already aborted and the filesystem is allowed to be used, then a later replay of old recovery logs may actually be bad. It also means that any problems in the recovery logs that are not "self healing" will be hit on every restart. |
| Comment by Mikhail Pershin [ 15/Jun/22 ] |
|
yes, the same thoughts here. I want to figure out first why write was lost, if it wasn't committed, I'd expect it to be replayed anyway. So probably retry is done not correctly, e.g. record id is wrong or something else, or that could be result of abort_recovery_mdt attempt, which skips update recovery, i.e. updates are not re-applied and then it is stuck in retry for missing writes. Anyway, despite the reason I think retry mechanism should presume that write might be lost and stops at some moment. I think that is covered by Alex patch https://review.whamcloud.com/#/c/47003/ already, need to discuss with him. For me it looks like with that patch there might be no need in retry at all, at least if we are sure that read error is returned from another server but is not network error. |
| Comment by Gerrit Updater [ 22/Jun/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47698 |
| Comment by Mikhail Pershin [ 22/Jun/22 ] |
|
Initial approach which resolves endless retry loop we've observed due to remote llog short read handling. Patch also makes obd_abort_recovery_mdt option to abort update recovery threads as well. That should prevent known endless retry cases and make possible manual intervention by using abort_recovery_mdt parameter if update recovery would stuck due to network problems. As noted above, more work is needed to remove update llogs upon abort_recovery_mdt setting and it is worth to think about limit for number of retries when remote server is not accessible, so far I have no idea what to choose as basis - recovery hard (or soft) timeout value maybe? |
| Comment by Mikhail Pershin [ 06/Jul/22 ] |
|
OK, with more analysis, I check the llog_reader and find out it doesn't report all inconsistencies. Modified version showed problem in update_log:
# llog_reader update_log
rec #7707 type=106a0000 len=1160 offset 8670472
in bitmap: rec #8245 is set!
llog has 1 records but header count is 2
Header size : 32768 llh_size : 496
Time : Thu Apr 7 12:52:40 2022
Number of records: 2 cat_idx: 9 last_idx: 8244
Target uuid :
-----------------------
#7707 (1160) id: 0 updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18
[0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1
[0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0
[0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4
[0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8
[0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10
[0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8
[0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13
[0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8
[0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17
p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000
p_1 - 16/\x0E+\x0080120000006\xBE0100000000
p_2 - 2/.
p_3 - 0/
p_4 - 0/
p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00
p_6 - 0/
p_7 - 0/
p_8 - 0/
p_9 - 0/
p_10 - 0/
p_11 - 0/
p_12 - 0/
p_13 - 0/
p_14 - 0/
p_15 - 0/
p_16 - 0/
p_17 - 0/
this is correct output, llog has only one record #7707, but its bitmap has also bit #8245 set, that is why count is 2 and that is why retry doesn't help, it reads the same bitmap and llog data again and again. I will think how to inject such corruption for test purposes and will add modified llog_reader as well |
| Comment by Gerrit Updater [ 12/Jul/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47934 |
| Comment by Gerrit Updater [ 18/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47698/ |
| Comment by Gerrit Updater [ 03/Aug/22 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48112 |
| Comment by Gerrit Updater [ 08/Aug/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/ |
| Comment by Gerrit Updater [ 22/Aug/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286 |
| Comment by Gerrit Updater [ 23/Aug/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310 |
| Comment by Gerrit Updater [ 25/Aug/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341 |
| Comment by Gerrit Updater [ 22/Nov/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/ |
| Comment by Peter Jones [ 08/Dec/22 ] |
|
All patches seem to have merged for 2.16 |
| Comment by Gerrit Updater [ 05/Jun/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51217 |
| Comment by Gerrit Updater [ 05/Jun/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51218 |
| Comment by Gerrit Updater [ 05/Jun/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51220 |