[LU-15938] MDT recovery did not finish due to corrupt llog record Created: 12/Jun/22  Updated: 05/Jun/23  Resolved: 08/Dec/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File 2022-06-12_15-24-23__DDN-3093_shalustre-lfs02-n26_mdt0019_mdt0003_update_log    
Issue Links:
Related
is related to LU-15937 lctl llog commands do not work for DN... Open
is related to LU-15761 cannot finish MDS recovery Resolved
is related to LU-15645 gap in recovery llog should not be a ... Resolved
is related to LU-15934 client refused mount with -EAGAIN bec... Resolved
is related to LU-15139 sanity test_160h: dt_record_write() A... Resolved
is related to LU-16203 zero records and empty plain llogs in... Resolved
is related to LU-16052 conf-sanity test_106: crash after osp... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A broken DNE recovery llog record was preventing MDT-MDT recovery from completing. MDT0003 was permanently unable to finish recovery with MDT0019, looping on:

llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process

There was a bad record in the llog file, and the recovery would process the llog (all but one other record had successfully been cancelled) and then hit a bad record and abort, then retry.

Since the DNE recovery llog for MDT0003 is stored on MDT0019, this necessitated "fixing" the llog file on MDT0019 by truncating it to zero bytes and which allowed MDT0003 recovery to finish.

Retrying recovery can be useful in some cases, if the remote MDT is inaccessible, but if there is a single bad record it makes sense to only retry once (in case the llog was in the middle of being written) and then cancel this record and continue with the rest of recovery, or at worst abort recovery with that MDT and cancel the whole llog file. Otherwise, this needs manual intervention to recover from this situation, which can't do better than cancelling the llog record (pending LU-15937) or delete the whole llog file.



 Comments   
Comment by Andreas Dilger [ 12/Jun/22 ]

This is similar to LU-15761 and LU-15645, but apparently the llog error is different enough that those patches did not allow the recovery to complete automatically.

Comment by Alex Zhuravlev [ 12/Jun/22 ]

where can I find that corrupted llog for analysis?

Comment by Dmitry Vensko [ 12/Jun/22 ]

attached the requested log

Comment by Peter Jones [ 13/Jun/22 ]

Mike

Could you please investigate

Thanks

Peter

Comment by Andreas Dilger [ 14/Jun/22 ]

The problematic llog file decoded by llog_reader from master shows:

rec #7707 type=106a0000 len=1160 offset 8670472
Header size : 32768      llh_size : 496
Time : Thu Apr  7 10:52:40 2022
Number of records: 2    cat_idx: 9      last_idx: 8244
Target uuid : 
-----------------------
#7707 (1160)updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18
        [0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1 
        [0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0 
        [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4 
        [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4 
        [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8 
        [0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10 
        [0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0 
        [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8 
        [0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13 
        [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8 
        [0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17 
        p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000
        p_1 - 16/\x0E+\x0080120000006\xBE0100000000
        p_2 - 2/.
        p_3 - 0/
        p_4 - 0/
        p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00
        p_6 - 0/
        p_7 - 0/
        p_8 - 0/
        p_9 - 0/
        p_10 - 0/
        p_11 - 0/
        p_12 - 0/
        p_13 - 0/
        p_14 - 0/
        p_15 - 0/
        p_16 - 0/
        p_17 - 0/

#8244 (1160)NOT SET updatelog record master_transno:45508195003 batchid:37129117505 flags:0x0 u_index:0 u_count:11 p_count:18
        [0x18400034de:0x6863:0x0] type:create/1 params:2 p_0:0 p_1:1 
        [0x18400034de:0x6863:0x0] type:ref_add/3 params:0 
        [0x18400034de:0x6863:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4 
        [0x18400034de:0x6863:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4 
        [0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8 
        [0x128000369b:0x174b8:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10 
        [0x128000369b:0x174b8:0x0] type:ref_add/3 params:0 
        [0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8 
        [0x128000369b:0x174b8:0x0] type:attr_set/5 params:1 p_0:13 
        [0x18400034de:0x6863:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8 
        [0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17 
        p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000H5ob\x00000000H5ob\x00000000H5ob\x000000
        p_1 - 16/\x9B6\x008012000000B8t\x0100000000
        p_2 - 2/.

The recovery from the source MDT reported an error when processing record 8245.

Comment by Mikhail Pershin [ 14/Jun/22 ]

yes, I've found the same. Log itself is not corrupted, all records are processed by llog_reader without errors, llog size is also fits data in llog. What is not OK so far - "Number of records: 2", first I thought that is OK because one record is llog_header but in fact llog_reader handles that and does decrement. So that is mismatch, llog reports 2 live records but has only 1. I am not sure if that is problem or not. 

I am checking code to see how -5 could be returned. My idea is that we might have read over llog size somehow

Other observation is that used server version misses the following patches:

LU-13974 llog: check stale osp object
LU-15645 obdclass: llog to handle gaps
LU-15761 obdclass: fix locking in llog_cat_refresh()

 It looks we don't have here direct result of any of them but still it would be better to update

Comment by Mikhail Pershin [ 14/Jun/22 ]

just addition to comment above, lgh_count mismatch may be caused by undo operations over llog, if pad record was involved, so that might indicate that there were undo operations over llog

Comment by Andreas Dilger [ 14/Jun/22 ]

From the debug logs it shows the MDT is getting -EIO when trying to read the next log, maybe because the llog is short, or it thinks it is short because of the number of records?

00000004:00000001:1.0:1654980511.161770:0:11204:0:(osp_md_object.c:1345:osp_md_read()) Process leaving (rc=4400 : 4400 : 1130)
00000040:00000001:1.0:1654980511.161772:0:11204:0:(llog_osd.c:1089:llog_osd_next_block()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
00000040:00000001:1.0:1654980511.161774:0:11204:0:(lustre_log.h:465:llog_next_block()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00000040:00000001:1.0:1654980511.161775:0:11204:0:(llog.c:572:llog_process_thread()) Process leaving via out (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
00000040:00080000:1.0:1654980511.161777:0:11204:0:(llog.c:761:llog_process_thread()) stop processing plain 0x2:2147506325:0 index 8245 count 3
00000040:00020000:1.0:1654980511.161778:0:11204:0:(llog.c:774:llog_process_thread()) lfs02-MDT0019-osp-MDT0003 retry remote llog process
00000040:00000001:1.0:1654980511.161781:0:11204:0:(llog.c:897:llog_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
00000040:00000001:1.0:1654980511.161782:0:11204:0:(llog_cat.c:904:llog_cat_process_cb()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
00000040:00000001:1.0:1654980511.161784:0:11204:0:(llog.c:742:llog_process_thread()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5)
00000040:00080000:1.0:1654980511.161785:0:11204:0:(llog.c:761:llog_process_thread()) stop processing catalog 0x1:2147484688:0 index 9 count 12
00000040:00000001:1.0:1654980511.161790:0:11204:0:(llog.c:897:llog_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
00000040:00000001:1.0:1654980511.161791:0:11204:0:(llog_cat.c:966:llog_cat_process_or_fork()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
Comment by Mikhail Pershin [ 15/Jun/22 ]

so far it looks like MDT0003 has record #8245 in local llog bitmap and is trying to read that records from MDT0019. Meanwhile MDT0019 doesn't have it neither in bitmap nor in file, file size ends exactly at #8244 record. That means that #8245 write request were sent to MDT0019 from the MDT0003, that is why it has it in bitmap but wasn't yet written on MDT0019 due to some reason but concurrent llog processing reading it.

As for loop in code, llog_osd_next_block() reads llog up to its end, get last_idx in llog as #8244, since it is less than expected #8245, it does read again, but cycle exists because current offset is equal to llog_size at that moment causing -5 error. I am not yet sure what happens next, if caller would try read again and again then needed update should arrive at some moment and read would succeeded but we don't see that. So either write is just lost completely and never happens or caller doesn't retry that read properly.

Comment by Mikhail Pershin [ 15/Jun/22 ]

https://review.whamcloud.com/47003 - could solve read arriving prior related write
It is still unclear why/could write be lost at all, so read will retry loop forever

So, at the moment the first problem to address is to prevent such llog processing loop if needed update is lost (presumably), at least we can't wait for it forever, probably we could check if there is related write inflight or so.

Second problem to address is abort_recovery_mdt option, being  used it skips update recovery but doesn't remove related update logs, so if we have llog corruption situation which need to be solved, the abort_recovery_mdt doesn't resolve it, every server remount would stuck on the same issues. I think if we decided to abort/skip update recovery there is no sense to keep update logs and they should be truncated. This is to be discussed, probably I am missing some cases when they must stay

 

Comment by Mikhail Pershin [ 15/Jun/22 ]

speaking of abort_recovery_mdt in the lod_sub_recovery_thread() the cycle breaks only on abort_recovery being set, so abort_recovery_mdt doesn't make things better but also doesn't stop that thread. That could explain why setting abort_recovery_mdt doesn't help with stopping stuck MDT-MDT recovery

Comment by Andreas Dilger [ 15/Jun/22 ]

Ok, so two problems then. The llog is permanently short for whatever reason (posdibly old non-atomic writes?), so retrying forever will never fix it. Having a few retries might be OK in case the write is in progress. Maybe we need a separate error code for a short read?

Secondly, abort_revov_mdt should definitely abort the MDT recovery. It also makes sense that this will clear or reset or delete the recovery logs so that they don't try to be re-used on the next recovery. If recovery is already aborted and the filesystem is allowed to be used, then a later replay of old recovery logs may actually be bad. It also means that any problems in the recovery logs that are not "self healing" will be hit on every restart.

Comment by Mikhail Pershin [ 15/Jun/22 ]

yes, the same thoughts here. I want to figure out first why write was lost, if it wasn't committed, I'd expect it to be replayed anyway. So probably retry is done not correctly, e.g. record id is wrong or something else, or that could be result of abort_recovery_mdt attempt, which skips update recovery, i.e. updates are not re-applied and then it is stuck in retry for missing writes.

Anyway, despite the reason I think retry mechanism should presume that write might be lost and stops at some moment. I think that is covered by Alex patch https://review.whamcloud.com/#/c/47003/ already, need to discuss with him. For me it looks like with that patch there might be no need in retry at all, at least if we are sure that read error is returned from another server but is not network error.

Comment by Gerrit Updater [ 22/Jun/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47698
Subject: LU-15938 lod: prevent endless retry in recovery thread
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c0135579c350a897313b4ff74a1cc7aea200ec7

Comment by Mikhail Pershin [ 22/Jun/22 ]

Initial approach which resolves endless retry loop we've observed due to remote llog short read handling. Patch also makes obd_abort_recovery_mdt option to abort update recovery threads as well. That should prevent known endless retry cases and make possible manual intervention by using abort_recovery_mdt parameter if update recovery would stuck due to network problems.

As noted above, more work is needed to remove update llogs upon abort_recovery_mdt setting and it is worth to think about limit for number of retries when remote server is not accessible, so far I have no idea what to choose as basis - recovery hard (or soft) timeout value maybe?

Comment by Mikhail Pershin [ 06/Jul/22 ]

OK, with more analysis, I check the llog_reader and find out it doesn't report all inconsistencies. Modified version showed problem in update_log:

# llog_reader update_log 
rec #7707 type=106a0000 len=1160 offset 8670472
in bitmap: rec #8245 is set!
llog has 1 records but header count is 2
Header size : 32768     llh_size : 496
Time : Thu Apr  7 12:52:40 2022
Number of records: 2    cat_idx: 9    last_idx: 8244
Target uuid : 
-----------------------
#7707 (1160) id: 0 updatelog record master_transno:45507493823 batchid:37129103921 flags:0x0 u_index:0 u_count:11 p_count:18
    [0x1840003b3b:0x2fce:0x0] type:create/1 params:2 p_0:0 p_1:1 
    [0x1840003b3b:0x2fce:0x0] type:ref_add/3 params:0 
    [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:2 p_1:3 p_2:4 
    [0x1840003b3b:0x2fce:0x0] type:insert/10 params:3 p_0:5 p_1:1 p_2:4 
    [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:6 p_1:7 p_2:8 
    [0x1280002b0e:0x1be36:0x0] type:insert/10 params:3 p_0:9 p_1:3 p_2:10 
    [0x1280002b0e:0x1be36:0x0] type:ref_add/3 params:0 
    [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:11 p_1:12 p_2:8 
    [0x1280002b0e:0x1be36:0x0] type:attr_set/5 params:1 p_0:13 
    [0x1840003b3b:0x2fce:0x0] type:xattr_set/7 params:3 p_0:14 p_1:15 p_2:8 
    [0x200000001:0x15:0x0] type:write/12 params:2 p_0:16 p_1:17 
    p_0 - 208/\x8E070000000000000000000000000000000000000000000000000000000000000000000000000000005ob\x00000000005ob\x00000000005ob\x0000
    p_1 - 16/\x0E+\x0080120000006\xBE0100000000
    p_2 - 2/.
    p_3 - 0/
    p_4 - 0/
    p_5 - 16384/\x0000000000000400jb\x0000000000@\x0000000000000300jb\x00000000..\x0000000000000C00000000000000trusted.dmv\x00d\x0000000\x00
    p_6 - 0/
    p_7 - 0/
    p_8 - 0/
    p_9 - 0/
    p_10 - 0/
    p_11 - 0/
    p_12 - 0/
    p_13 - 0/
    p_14 - 0/
    p_15 - 0/
    p_16 - 0/
    p_17 - 0/

 this is correct output, llog has only one record #7707, but its bitmap has also bit #8245 set, that is why count is 2 and that is why retry doesn't help, it reads the same bitmap and llog data again and again. I will think how to inject such corruption for test purposes and will add modified llog_reader as well

Comment by Gerrit Updater [ 12/Jul/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47934
Subject: LU-15938 llog: llog_reader to detect more corruptions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6f90aa1b8234121d0fc03ce12f98268ff3fcd29

Comment by Gerrit Updater [ 18/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47698/
Subject: LU-15938 lod: prevent endless retry in recovery thread
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1a24dcdce121787428ea820561cfa16ae24bdf82

Comment by Gerrit Updater [ 03/Aug/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48112
Subject: LU-15938 llog: more checks in llog_reader
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4b619366a07813394cfb7abf1d79bb9512605401

Comment by Gerrit Updater [ 08/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47934/
Subject: LU-15938 llog: llog_reader to detect more corruptions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d914a5b7a49ac6b61c0191a0966d1f684a6957b6

Comment by Gerrit Updater [ 22/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48286
Subject: LU-15938 lod: prevent endless retry in recovery thread
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7e2533728b5d574ec8638742cbfe574580c0a063

Comment by Gerrit Updater [ 23/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48310
Subject: LU-15938 llog: llog_reader to detect more corruptions
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 02ea0e325eabc57d95051e79ffe1cc87c2243ced

Comment by Gerrit Updater [ 25/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48341
Subject: LU-15938 llog: Fix chunk re-read case in llog_process_thread
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a7015dccd3e960516c95510663626f075191d4bd

Comment by Gerrit Updater [ 22/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48112/
Subject: LU-15938 llog: more checks in llog_reader
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 386ffcdbb4c9b89f798de4c83a51a3f020542c8b

Comment by Peter Jones [ 08/Dec/22 ]

All patches seem to have merged for 2.16

Comment by Gerrit Updater [ 05/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51217
Subject: LU-15938 lod: prevent endless retry in recovery thread
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d66b517c9207dae3dd6da75266e78e50dfbc3f93

Comment by Gerrit Updater [ 05/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51218
Subject: LU-15938 llog: llog_reader to detect more corruptions
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: e43bf0086e8f80f128cf868b5dca6079872f6a62

Comment by Gerrit Updater [ 05/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51220
Subject: LU-15938 llog: more checks in llog_reader
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: a0d25b76f6d41e164536a1c1cd46d503338643e7

Generated at Sat Feb 10 03:22:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.