[LU-5836]  Error "Device or resource busy" after attempting release of file cleared from 'dirty' Created: 31/Oct/14  Updated: 14/Sep/15

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andrew Moe Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: HSM, patch
Environment:

CentOS 6


Severity: 3
Epic: client, server
Project: HSM
Rank (Obsolete): 16361

 Description   

I can consistently replicate this error message.

# Write a file
$ dd if=/dev/zero of=file01 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00335841 s, 312 MB/s

# Archive the file
$ sudo lfs hsm_archive file01

# Wait for file to archive 

# Write to file to make it 'dirty'
$ echo "0" >> file01

# Clear the dirty state
$ lfs hsm_clear --dirty file01

# Attempt to release the file
$ sudo lfs hsm_release file02
Cannot send HSM request (use of file02): Device or resource busy

I suppose it might make sense that a file should not be released if it is truly dirty. But the error message does not seem to be appropriate. Perhaps it should not even be possible to clear a 'dirty' state

This is a comment from another person I've consulted on this:

The error comes from the MDT:

00000004:20000000:0.0:1414607590.182318:0:1585:0:(mdt_open.c:2042:mdt_hsm_release()) [0x2000013c2:0x116c3:0x0] data_version mismatches: packed=4313503196 and on-disk=4313503194

so it doesn't set the OBD_MD_FLRELEASED bit, and the client in turn returns EBUSY.

No idea what that means though.



 Comments   
Comment by Gerrit Updater [ 20/May/15 ]

Ulka Vaze (ulka.vaze@yahoo.in) uploaded a new patch: http://review.whamcloud.com/14874
Subject: LU-5836 hsm: Error "Device or resource busy"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5add76876f45e3a64c0af0af1ecd288ca2cffc3

Comment by Ulka Vaze (Inactive) [ 20/May/15 ]

Here is Analysis of this issue-

1. Since dirty flag is cleared request goes to coordinator.
2. However coordinator check the version of requested file and version of archived copy. and throws following error

00000004:20000000:0.0:1432012500.276456:0:14313:0:(mdt_open.c:1638:mdt_hsm_release()) [0x200000400:0x1:0x0] data_version mismatches: packed=120259085297 and on-disk=68719476768

Here it finds version is mismatched so returns request with error EPERM
3. However for any error we do not set OBD_MD_FLRELEASED
4. So when we return call to agent it throws error EBUSY
Following is code snippet from ll_close_inode_openhandle (llite/file.c)

if (rc == 0 && op_data->op_bias & MDS_HSM_RELEASE) {
        struct mdt_body *body;
        body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
       
        if (!(body->mbo_valid & OBD_MD_FLRELEASED)) {
                            rc = -EBUSY;
           
        }

4. EBUSY is misleading message and we should propagate EPERM message coming from coordinator

So proposed solution is -
1.Added one more flag OBD_MD_FLDIRTY which will be set on EPERM.
This is because there might be cases when we need to return EBUSY. So this will distinguish the case
2.in llite above function check for this flag is added and return EPERM

Code is pushed for review.

Test logs -

[root@cli-3 ~]# echo "test" >> /mnt/lustre2/testfile2
[root@cli-3 ~]# lfs hsm_state /mnt/lustre2/testfile2
/mnt/lustre2/testfile2: (0x00000000)
[root@cli-3 ~]# lfs hsm_archive /mnt/lustre2/testfile2
***** FD =  3 PATH = /mnt/lustre2/testfile2 *****
[root@cli-3 ~]# lfs hsm_state   /mnt/lustre2/testfile2
/mnt/lustre2/testfile2: (0x00000009) exists archived, archive_id:1
[root@cli-3 ~]# echo xxx >> /mnt/lustre2/testfile2
[root@cli-3 ~]# lfs hsm_state   /mnt/lustre2/testfile2
/mnt/lustre2/testfile2: (0x0000000b) exists dirty archived, archive_id:1
[root@cli-3 ~]# lfs hsm_clear --dirty /mnt/lustre2/testfile2
[root@cli-3 ~]# lfs hsm_state   /mnt/lustre2/testfile2
/mnt/lustre2/testfile2: (0x00000009) exists archived, archive_id:1
[root@cli-3 ~]# lfs hsm_release /mnt/lustre2/testfile2
***** FD =  3 PATH = /mnt/lustre2/testfile2 *****
Cannot send HSM request (use of /mnt/lustre2/testfile2): Operation not permitted

Comment by Aditya Pandit [ 10/Sep/15 ]

Can you please review it?

Comment by Patrick Farrell (Inactive) [ 10/Sep/15 ]

This looks like an OK way to get the result Ulka explains in the comment above, but I'm not sure it's the correct result. -EPERM is also not a correct error here, is it?

This is a design level question I don't have an answer for:
Shouldn't clearing the dirty flag, even manually, result in being able to release the file? It seems like checking the file version like this does is voiding the intended effect of clearing the dirty flag. If not, how could a user escape the situation, short of deleting the file?

Actually, as I think further, perhaps this is not a bug: I think the ability to clear the dirty flag manually is a debug/emergency recovery option? Use of such a manual intervention tool could be expected to cause problems if it is not used in exactly the right situation. So less than perfect handling of this case may not be something we need to fix.

What were you trying to accomplish by clearing the dirty flag manually?

Comment by Andrew Moe [ 10/Sep/15 ]

This bug was not preventing me from accomplishing anything in particular. I reported this bug because I performed a sequence of valid operations and experienced an error message that did not seem appropriate for situation. I'm not asserting that the file should be released or not released; I agree that this would be a design level question.

Comment by Aditya Pandit [ 11/Sep/15 ]

If you archive the file again. Releasing the file works. I have added that in the test script. Can someone from design team review and post what should be the ideal behavior?

Comment by Andrew Moe [ 14/Sep/15 ]

It's been a while since this bug has been reported, so it may take some time to rebuild our test environment.

Generated at Sat Feb 10 01:54:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.