[LU-16336] LFSCK should fix inconsistencies caused by recovery abort Created: 23/Nov/22  Updated: 30/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lai Siyao Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14470 striped directory layout mismatch aft... Open
is related to LU-10616 replay-single test_70b fails with 'ru... Open
is related to LU-15624 replay-single and ost-pools failed: r... Open
is related to LU-16065 replay-single test_81a: rm remote dir... Open
is related to LU-16159 remove update logs after recovery abort Reopened
is related to LU-16335 "lfs rm_entry" failed to remove broke... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In LU-16159, the update logs are canceled upon recovery, which will cause inconsistencies in the filesystem. LFSCK should be able to fix these inconsistencies.

This is visible in tests like replay-single test_70b that sometimes leave an undeletable directory behind after test completion (LU-10616). There are various workarounds (e.g. LU-16335 to use "lfs rm_entry" to unlink the directory from the namespace, or EX-6692 to reformat the filesystem), but it would be much better to have LFSCK fix these directories and/or allow them to actually be unlinked from the filesystem.



 Comments   
Comment by Andreas Dilger [ 30/Mar/23 ]

We've seen that the rm_entry workaround to "hide" the bad entry is only temporary, and running LFSCK on the filesystem will restore the broken entry back to .lustre/lost+found/<fsname>-MDT0000 where it will again be undeletable.

We either need to be able to delete such a directory with missing stripes using "rmdir" if we are sure the MDT is available but the the stripe is missing, or have LFSCK fix the missing stripe in the directory so that it can be removed normally.

Is it possible that patch https://review.whamcloud.com/47385 "LU-14470 dne: striped mkdir replay by request" will avoid such recovery failures by allowing the client to recover the broken directory even when the MDT recovery is aborted?

Comment by Lai Siyao [ 30/Mar/23 ]

Yes, LU-14470 can help create failure, and beyond that, we need to consider other distributed transaction replay as well, e.g. migration and restripe. Besides, if client replay is aborted as well, it may still leave dangling name entries.

I didn't test yet, IMHO LFSCK won't simply move dangling name entries to lost+found.

Generated at Sat Feb 10 03:26:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.