[LU-10888] 'lctl abort_recovery' allow aborting recovery between MDTs Created: 09/Apr/18  Updated: 25/Nov/19  Due: 09/Jul/19  Resolved: 15/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Lai Siyao Assignee: Hongchao Zhang
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11111 crash doing LFSCK: orph_index_insert(... Resolved
is related to LU-11419 lfsck does not complete phase2 Resolved
is related to LU-12546 add option to abort recovery between ... Resolved
Epic/Theme: DNE, DNE2, dne
Rank (Obsolete): 9223372036854775807

 Description   

'lctl abort_recovery' doesn't abort recovery between MDTs, because unlike abort recovery on single MDT system which only fail unfinished operations, this may break system consistency, so as a tradeoff, Lustre chose consistency over availability. But there are two major causes if recovery between MDTs doesn't finish, the first is network issue, for this type, we can wait indefinitely for network to recover, while the second is software bug, which is difficult for user to fix manually on backend filesystem.

Now lfsck is ready, which can fix inconsistency in the system. So we should provide an option to allow user to abort recovery between MDTs, and then fix inconsistencies.



 Comments   
Comment by Patrick Farrell (Inactive) [ 04/Feb/19 ]

I think it's also possible for this lack of MDS-MDS abort_recovery to cause hangs in certain situations.  During some testing at Cray, we had an MDS LBUG that happened during replay on MDS restart, so we tried abort_recovery.  That hung in some complex scenario related to cross MDT communication.  We weren't aware at the time that cross-MDT ops weren't handled by abort recovery, but it seems likely to be related.

Comment by Hongchao Zhang [ 15/Jul/19 ]

abort_recovery has been enabled between MDTS, the new requirement of aborting
the recovery between MDTs but not aborting between client/MDT will be implemented
in a new ticket

Generated at Sat Feb 10 02:39:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.