Uploaded image for project: 'Lemur'
  1. Lemur
  2. LMR-5

Robinhood lhsm_remove policy not compatible with Lemur

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • 9223372036854775807

    Description

      I noticed today during some testing that the Robinhood feature of delayed
      delete, wherein after a file is deleted on Lustre, robinhood will schedule
      removing the corresponding HSM backend object after a configurable period - is
      not working with Lemur.

      This may well be a Robinhood problem, but I wondered if you were aware of it
      and could maybe comment on what might be going wrong?

      My setup is with Robinhood v3.0.0, and it's the 'lhsm_remove' policy. I create
      a file on lustre, archive it, and then delete it from Lustre. So there is an
      object for this file on the HSM backend now, and files that have been deleted
      like this are listed by robinhood in it's 'SOFT_RM' table in the database.

      Robinhood then submits a HSM_REMOVE operation for this file, I see the 'remove'
      job submitted to the coordinator:

      lrh=[type=10680000 len=136 idx=1/5] fid=[0x200000bd0:0x3:0x0]  dfid=[0x200000bd0:0x3:0x0] compound/cookie=0x591c1c41/0x591c1c3e action=REMOVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0  status=SUCCEED data=[]                                                   
      

      and then I see the following in lhsmd logs:

      ALERT 2017/05/17 15:01:19 /root/rpmbuild/BUILD/lemur-.5.1_2_g55041d6/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent_action.go:127: Error reading UUID: trusted.lhsm_uuid: no such file or directory (id:3 REMOVE [0x200000bd0:0x3:0x0] )                                                  
      

      Is this because it's trying to do the equivalent of a 'lfs hsm_remove FID' on a
      FID that no longer is in the filesystem?

      I think this feature works with lhsmtool_posix because they use the file's FID
      to name the file on the HSM backend, and therefore the copytool can clean up
      the object in the backend. Here is what the logs of that copytool shows for
      the same remove operation:

      1495099675.796599 lhsmtool_posix[3859]: '[0x200000bd0:0xc:0x0]' action REMOVE 
      reclen 72, cookie=0x591c1c42 1495099675.797789 lhsmtool_posix[3859]: cannot get 
      path of FID [0x200000bd0:0xc:0x0]: No such file or directory (2)         
      1495099675.797837 lhsmtool_posix[3859]: removing file                    
      '/mnt/qstar/rds-s1/hsm/000c/0000/0bd0/0000/0002/0000/0x200000bd0:0xc:0x0' 
      1495099675.828181 lhsmtool_posix[3859]: Action completed, notifying coordinator 
      cookie=0x591c1c42, FID=[0x200000bd0:0xc:0x0], hp_flags=0 err=0           
      1495099675.828969 lhsmtool_posix[3859]: llapi_hsm_action_end() on        
      '/rds-s1/.lustre/fid/0x200000bd0:0xc:0x0' ok (rc=0)                      
      

      I'm not sure how the Lemur UUID is constructed, I presume if a request like
      this comes in for a HSM_REMOVE on a file that has been deleted from Lustre, it
      isn't possible for Lemur to work out what that file's UUID would have been and
      therefore HSM backend object to clean that up?

      Attachments

        Activity

          People

            Lemur Triage Lemur Triage
            mrb Matt Rásó-Barnett (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: