Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
9223372036854775807
Description
I noticed today during some testing that the Robinhood feature of delayed
delete, wherein after a file is deleted on Lustre, robinhood will schedule
removing the corresponding HSM backend object after a configurable period - is
not working with Lemur.
This may well be a Robinhood problem, but I wondered if you were aware of it
and could maybe comment on what might be going wrong?
My setup is with Robinhood v3.0.0, and it's the 'lhsm_remove' policy. I create
a file on lustre, archive it, and then delete it from Lustre. So there is an
object for this file on the HSM backend now, and files that have been deleted
like this are listed by robinhood in it's 'SOFT_RM' table in the database.
Robinhood then submits a HSM_REMOVE operation for this file, I see the 'remove'
job submitted to the coordinator:
lrh=[type=10680000 len=136 idx=1/5] fid=[0x200000bd0:0x3:0x0] dfid=[0x200000bd0:0x3:0x0] compound/cookie=0x591c1c41/0x591c1c3e action=REMOVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
and then I see the following in lhsmd logs:
ALERT 2017/05/17 15:01:19 /root/rpmbuild/BUILD/lemur-.5.1_2_g55041d6/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent_action.go:127: Error reading UUID: trusted.lhsm_uuid: no such file or directory (id:3 REMOVE [0x200000bd0:0x3:0x0] )
Is this because it's trying to do the equivalent of a 'lfs hsm_remove FID' on a
FID that no longer is in the filesystem?
I think this feature works with lhsmtool_posix because they use the file's FID
to name the file on the HSM backend, and therefore the copytool can clean up
the object in the backend. Here is what the logs of that copytool shows for
the same remove operation:
1495099675.796599 lhsmtool_posix[3859]: '[0x200000bd0:0xc:0x0]' action REMOVE reclen 72, cookie=0x591c1c42 1495099675.797789 lhsmtool_posix[3859]: cannot get path of FID [0x200000bd0:0xc:0x0]: No such file or directory (2) 1495099675.797837 lhsmtool_posix[3859]: removing file '/mnt/qstar/rds-s1/hsm/000c/0000/0bd0/0000/0002/0000/0x200000bd0:0xc:0x0' 1495099675.828181 lhsmtool_posix[3859]: Action completed, notifying coordinator cookie=0x591c1c42, FID=[0x200000bd0:0xc:0x0], hp_flags=0 err=0 1495099675.828969 lhsmtool_posix[3859]: llapi_hsm_action_end() on '/rds-s1/.lustre/fid/0x200000bd0:0xc:0x0' ok (rc=0)
I'm not sure how the Lemur UUID is constructed, I presume if a request like
this comes in for a HSM_REMOVE on a file that has been deleted from Lustre, it
isn't possible for Lemur to work out what that file's UUID would have been and
therefore HSM backend object to clean that up?