Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15001

improve recovery of interrupted directory migrate

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      Currently, if a directory migration has been interrupted (e.g. MDS crash), nothing is done to resume the migration operation. This is precautionary, in case there was something with the migration operation itself that caused the problem.

      However, to resume the migration operation, the exact same options for "lfs migrate -m" need to be specified by the user in order to resume/finish the migration. It would be much more convenient in this case if the migration could be resumed without having to specify the same options, and the MDS would just "know" these options if asked to migrate the directory again. That simplifies the user handling, and (AFAICS) does not add any risk since the migration request from the user will fail if the same options are not specified.

      Attachments

        Issue Links

          Activity

            [LU-15001] improve recovery of interrupted directory migrate

            If letting the MDS continue the migration for that directory is easier, then that is fine with me. If it doesn't happen automatically, then there is less concern if there is a problem with the migration itself. I think in most cases of interrupted migration it is because the MDS rebooted for another reason, especially because recursive migrate of a large directory tree (possibly the whole filesystem) may take a very long time, and there is (AFAIK) no easy way to monitor if it is finished or how close it is to being finished.

            I think the most important part is to make it easy to finish the migration (not require users to specify the options that the MDS already knows).

            Separately, it probably makes sense for the migration to stop itself when it is close to doing something bad (eg. target MDT is almost full), and refuse to do "bad" migration requests (eg. migrate a whole directory tree to be striped), but those should be separate patches.

            adilger Andreas Dilger added a comment - If letting the MDS continue the migration for that directory is easier, then that is fine with me. If it doesn't happen automatically, then there is less concern if there is a problem with the migration itself. I think in most cases of interrupted migration it is because the MDS rebooted for another reason, especially because recursive migrate of a large directory tree (possibly the whole filesystem) may take a very long time, and there is (AFAIK) no easy way to monitor if it is finished or how close it is to being finished. I think the most important part is to make it easy to finish the migration (not require users to specify the options that the MDS already knows). Separately, it probably makes sense for the migration to stop itself when it is close to doing something bad (eg. target MDT is almost full), and refuse to do "bad" migration requests (eg. migrate a whole directory tree to be striped), but those should be separate patches.
            laisiyao Lai Siyao added a comment -

            Directory migration by command "lfs migrate -m" will migrate sub files one by one from client, which is different from that of directory restripe and directory auto-split.

            Your proposal looks reasonable, but how to "repair" the partially migrated directory? Add an option for "lfs migrate -m"? or let server to continue the migration of the rest of the directory? The latter may fail because the error that caused the failure may not be fixed yet.

            laisiyao Lai Siyao added a comment - Directory migration by command "lfs migrate -m" will migrate sub files one by one from client, which is different from that of directory restripe and directory auto-split. Your proposal looks reasonable, but how to "repair" the partially migrated directory? Add an option for "lfs migrate -m"? or let server to continue the migration of the rest of the directory? The latter may fail because the error that caused the failure may not be fixed yet.

            Is there any way to stop migration of a directory tree once it has started? Currently, migration runs as a thread on the MDS, but I'm not aware if there is any way to safely stop the migration once it has been started.

            It would also be desirable (possibly together with LU-14975) to allow completing the interrupted migration on a single directory, but not continue to migrate the rest of the directory tree.

            In several cases I've seen, users want to "repair" the partially-migrated directory, but not continue to migrate the rest of the directory tree (often because the initial migration parameters are bad and they want to change/stop the migration).

            adilger Andreas Dilger added a comment - Is there any way to stop migration of a directory tree once it has started? Currently, migration runs as a thread on the MDS, but I'm not aware if there is any way to safely stop the migration once it has been started. It would also be desirable (possibly together with LU-14975 ) to allow completing the interrupted migration on a single directory, but not continue to migrate the rest of the directory tree. In several cases I've seen, users want to "repair" the partially-migrated directory, but not continue to migrate the rest of the directory tree (often because the initial migration parameters are bad and they want to change/stop the migration).

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: