Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8820

skip LL_IOC_HSM_COPY_START for HSM removes

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Robert has pointed out that there is no need for the copytool to call the LL_IOC_HSM_COPY_START ioctl (which sends a MDS_HSM_PROGRESS) when processing a remove action.

      Attachments

        Issue Links

          Activity

            [LU-8820] skip LL_IOC_HSM_COPY_START for HSM removes
            bougetq Quentin Bouget (Inactive) added a comment - - edited

            > The action performed by the CT for remove is.

            Yes, you are right.

            >I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked?

            I think it is up to the copytool to choose whether and how to act upon receiving a cancel request.

            So the only "inconsistency" is that the coordinator can report a request as canceled even though it may not be canceled; in which case one can check the changelog records to find out what really happened.

            bougetq Quentin Bouget (Inactive) added a comment - - edited > The action performed by the CT for remove is. Yes, you are right. >I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked? I think it is up to the copytool to choose whether and how to act upon receiving a cancel request. So the only "inconsistency" is that the coordinator can report a request as canceled even though it may not be canceled; in which case one can check the changelog records to find out what really happened.
            jhammond John Hammond added a comment -

            > I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were?

            The action performed by the CT for remove is.

            > For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it?

            I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked?

            jhammond John Hammond added a comment - > I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were? The action performed by the CT for remove is. > For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it? I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked?

            I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were?

            For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it?

            bougetq Quentin Bouget (Inactive) added a comment - I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were? For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it?
            jhammond John Hammond added a comment -

            125 is ECANCELED, 0 is success, so this makes sense. I think the new behavior is correct. It's analogous to hitting Ctrl-C just after starting rm vs cp (on a large file). Since rm (unlink()) is atomic, we wouldn't expect the Ctrl-C to prevent the file from being removed. But file copy is not atomic, so we would expect Ctrl-C to do stop the copy.

            jhammond John Hammond added a comment - 125 is ECANCELED, 0 is success, so this makes sense. I think the new behavior is correct. It's analogous to hitting Ctrl-C just after starting rm vs cp (on a large file). Since rm (unlink()) is atomic, we wouldn't expect the Ctrl-C to prevent the file from being removed. But file copy is not atomic, so we would expect Ctrl-C to do stop the copy.

            Hi John,

            I am working on LU-8987 which aims at enabling test_225 (I merely removed the return statement at the start of the test). The modified test used to pass before this patch. It would seem this patch affected the content of the changelog when a remove request gets canceled. I am just wondering if the new behaviour is the "correct" one or if a regression was masked by test_225 being disabled.

            test_225 expects a changelog record with the flag 0x27D, from what I gathered in "include/lustre/lustre_user.h" the 0x200 part is related to the HSM_REMOVE operation and the 0x7D means the "HSM return code" is 125.
            After this patch the flag is not 0x27D anymore but 0x200.

            bougetq Quentin Bouget (Inactive) added a comment - Hi John, I am working on LU-8987 which aims at enabling test_225 (I merely removed the return statement at the start of the test). The modified test used to pass before this patch. It would seem this patch affected the content of the changelog when a remove request gets canceled. I am just wondering if the new behaviour is the "correct" one or if a regression was masked by test_225 being disabled. test_225 expects a changelog record with the flag 0x27D, from what I gathered in "include/lustre/lustre_user.h" the 0x200 part is related to the HSM_REMOVE operation and the 0x7D means the "HSM return code" is 125. After this patch the flag is not 0x27D anymore but 0x200.
            jhammond John Hammond added a comment -

            Quentin, test_225() is effectively disabled. The way the CDT is implemented it processes the log sequentially, so it sends the remove, then marks the remove action canceled, then sends the cancel. It's a bit weird, but so is most of the way the CDT code is factored. But I don't see us changing this behavior very soon. Are you running a modified version of test_225()? If this functionality is important to you then could you please explain why?

            jhammond John Hammond added a comment - Quentin, test_225() is effectively disabled. The way the CDT is implemented it processes the log sequentially, so it sends the remove, then marks the remove action canceled, then sends the cancel. It's a bit weird, but so is most of the way the CDT code is factored. But I don't see us changing this behavior very soon. Are you running a modified version of test_225()? If this functionality is important to you then could you please explain why?

            Since this patch, test_225 does not pass anymore, has this patch altered the behaviour of changelogs?

            bougetq Quentin Bouget (Inactive) added a comment - Since this patch, test_225 does not pass anymore, has this patch altered the behaviour of changelogs?
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23700/
            Subject: LU-8820 hsm: skip LL_IOC_HSM_COPY_START for HSM removes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0cc1756e9fb28609705f5f86b8e2197f57b40513

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23700/ Subject: LU-8820 hsm: skip LL_IOC_HSM_COPY_START for HSM removes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0cc1756e9fb28609705f5f86b8e2197f57b40513

            People

              jhammond John Hammond
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: