[LU-8820] skip LL_IOC_HSM_COPY_START for HSM removes Created: 10/Nov/16  Updated: 24/Apr/17  Resolved: 16/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: hsm

Issue Links:
Related
is related to LU-8987 sanity-hsm: test_225 looks like it wa... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Robert has pointed out that there is no need for the copytool to call the LL_IOC_HSM_COPY_START ioctl (which sends a MDS_HSM_PROGRESS) when processing a remove action.



 Comments   
Comment by Gerrit Updater [ 10/Nov/16 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/23700
Subject: LU-8820 hsm: skip LL_IOC_HSM_COPY_START for HSM removes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b4f5980394bb77242096b7cc73b606da1b1b773c

Comment by Gerrit Updater [ 16/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23700/
Subject: LU-8820 hsm: skip LL_IOC_HSM_COPY_START for HSM removes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0cc1756e9fb28609705f5f86b8e2197f57b40513

Comment by Peter Jones [ 16/Mar/17 ]

Landed for 2.10

Comment by Quentin Bouget [ 21/Apr/17 ]

Since this patch, test_225 does not pass anymore, has this patch altered the behaviour of changelogs?

Comment by John Hammond [ 21/Apr/17 ]

Quentin, test_225() is effectively disabled. The way the CDT is implemented it processes the log sequentially, so it sends the remove, then marks the remove action canceled, then sends the cancel. It's a bit weird, but so is most of the way the CDT code is factored. But I don't see us changing this behavior very soon. Are you running a modified version of test_225()? If this functionality is important to you then could you please explain why?

Comment by Quentin Bouget [ 21/Apr/17 ]

Hi John,

I am working on LU-8987 which aims at enabling test_225 (I merely removed the return statement at the start of the test). The modified test used to pass before this patch. It would seem this patch affected the content of the changelog when a remove request gets canceled. I am just wondering if the new behaviour is the "correct" one or if a regression was masked by test_225 being disabled.

test_225 expects a changelog record with the flag 0x27D, from what I gathered in "include/lustre/lustre_user.h" the 0x200 part is related to the HSM_REMOVE operation and the 0x7D means the "HSM return code" is 125.
After this patch the flag is not 0x27D anymore but 0x200.

Comment by John Hammond [ 24/Apr/17 ]

125 is ECANCELED, 0 is success, so this makes sense. I think the new behavior is correct. It's analogous to hitting Ctrl-C just after starting rm vs cp (on a large file). Since rm (unlink()) is atomic, we wouldn't expect the Ctrl-C to prevent the file from being removed. But file copy is not atomic, so we would expect Ctrl-C to do stop the copy.

Comment by Quentin Bouget [ 24/Apr/17 ]

I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were?

For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it?

Comment by John Hammond [ 24/Apr/17 ]

> I don't get it, "lfs hsm_remove" is not atomic, so why act as if it were?

The action performed by the CT for remove is.

> For tools that rely on parsing the changelogs, this can actually be harmful, a cancelled remove would be interpreted as a successful one, wouldn't it?

I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked?

Comment by Quentin Bouget [ 24/Apr/17 ]

> The action performed by the CT for remove is.

Yes, you are right.

>I think there's a race that exists with or without this patch. You can cancel a request after the CT has removed the file but before it has reported success. Then how should the request be marked?

I think it is up to the copytool to choose whether and how to act upon receiving a cancel request.

So the only "inconsistency" is that the coordinator can report a request as canceled even though it may not be canceled; in which case one can check the changelog records to find out what really happened.

Generated at Sat Feb 10 02:20:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.