[LU-15001] improve recovery of interrupted directory migrate Created: 11/Sep/21  Updated: 04/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13492 lfs migrate -m returns Operation not ... Open
is related to LU-14975 DNE3: directory migration in non-recu... Resolved
is related to LU-14719 "lfs migrate -m" creates broken agent... Resolved
is related to LU-14975 DNE3: directory migration in non-recu... Resolved
is related to LU-14211 DNE3: mechanism to interrupt and resu... Open
is related to LU-11776 add "lfs find" support for directory ... Resolved
is related to LU-15990 "lfs find" to scan for directory hash... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Currently, if a directory migration has been interrupted (e.g. MDS crash), nothing is done to resume the migration operation. This is precautionary, in case there was something with the migration operation itself that caused the problem.

However, to resume the migration operation, the exact same options for "lfs migrate -m" need to be specified by the user in order to resume/finish the migration. It would be much more convenient in this case if the migration could be resumed without having to specify the same options, and the MDS would just "know" these options if asked to migrate the directory again. That simplifies the user handling, and (AFAICS) does not add any risk since the migration request from the user will fail if the same options are not specified.



 Comments   
Comment by Andreas Dilger [ 05/Oct/21 ]

Is there any way to stop migration of a directory tree once it has started? Currently, migration runs as a thread on the MDS, but I'm not aware if there is any way to safely stop the migration once it has been started.

It would also be desirable (possibly together with LU-14975) to allow completing the interrupted migration on a single directory, but not continue to migrate the rest of the directory tree.

In several cases I've seen, users want to "repair" the partially-migrated directory, but not continue to migrate the rest of the directory tree (often because the initial migration parameters are bad and they want to change/stop the migration).

Comment by Lai Siyao [ 08/Oct/21 ]

Directory migration by command "lfs migrate -m" will migrate sub files one by one from client, which is different from that of directory restripe and directory auto-split.

Your proposal looks reasonable, but how to "repair" the partially migrated directory? Add an option for "lfs migrate -m"? or let server to continue the migration of the rest of the directory? The latter may fail because the error that caused the failure may not be fixed yet.

Comment by Andreas Dilger [ 10/Oct/21 ]

If letting the MDS continue the migration for that directory is easier, then that is fine with me. If it doesn't happen automatically, then there is less concern if there is a problem with the migration itself. I think in most cases of interrupted migration it is because the MDS rebooted for another reason, especially because recursive migrate of a large directory tree (possibly the whole filesystem) may take a very long time, and there is (AFAIK) no easy way to monitor if it is finished or how close it is to being finished.

I think the most important part is to make it easy to finish the migration (not require users to specify the options that the MDS already knows).

Separately, it probably makes sense for the migration to stop itself when it is close to doing something bad (eg. target MDT is almost full), and refuse to do "bad" migration requests (eg. migrate a whole directory tree to be striped), but those should be separate patches.

Generated at Sat Feb 10 03:14:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.