[LU-15001] improve recovery of interrupted directory migrate Created: 11/Sep/21 Updated: 04/Jul/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
Currently, if a directory migration has been interrupted (e.g. MDS crash), nothing is done to resume the migration operation. This is precautionary, in case there was something with the migration operation itself that caused the problem. However, to resume the migration operation, the exact same options for "lfs migrate -m" need to be specified by the user in order to resume/finish the migration. It would be much more convenient in this case if the migration could be resumed without having to specify the same options, and the MDS would just "know" these options if asked to migrate the directory again. That simplifies the user handling, and (AFAICS) does not add any risk since the migration request from the user will fail if the same options are not specified. |
| Comments |
| Comment by Andreas Dilger [ 05/Oct/21 ] |
|
Is there any way to stop migration of a directory tree once it has started? Currently, migration runs as a thread on the MDS, but I'm not aware if there is any way to safely stop the migration once it has been started. It would also be desirable (possibly together with In several cases I've seen, users want to "repair" the partially-migrated directory, but not continue to migrate the rest of the directory tree (often because the initial migration parameters are bad and they want to change/stop the migration). |
| Comment by Lai Siyao [ 08/Oct/21 ] |
|
Directory migration by command "lfs migrate -m" will migrate sub files one by one from client, which is different from that of directory restripe and directory auto-split. Your proposal looks reasonable, but how to "repair" the partially migrated directory? Add an option for "lfs migrate -m"? or let server to continue the migration of the rest of the directory? The latter may fail because the error that caused the failure may not be fixed yet. |
| Comment by Andreas Dilger [ 10/Oct/21 ] |
|
If letting the MDS continue the migration for that directory is easier, then that is fine with me. If it doesn't happen automatically, then there is less concern if there is a problem with the migration itself. I think in most cases of interrupted migration it is because the MDS rebooted for another reason, especially because recursive migrate of a large directory tree (possibly the whole filesystem) may take a very long time, and there is (AFAIK) no easy way to monitor if it is finished or how close it is to being finished. I think the most important part is to make it easy to finish the migration (not require users to specify the options that the MDS already knows). Separately, it probably makes sense for the migration to stop itself when it is close to doing something bad (eg. target MDT is almost full), and refuse to do "bad" migration requests (eg. migrate a whole directory tree to be striped), but those should be separate patches. |