Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5683

HSM: re-archiving dirty file fails, request for cookie not found

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0
    • Scientific Linux 6.5, Lutre 2.6.0, 1 MDS, 2 OSSs, 2 clients, all VMs hosted on KVM.
      Also reproducible with CentOS 6.5 in other environment.
    • 3
    • 15920

    Description

      During some tests, an issue has been found that when a file is archived, released, changed and re-archived, the re-archival fails under some circumstances. The problem could be reproduced in a very simple way using the lhsmd_posix:

      Boot machines, start Lustre services (MDS, OSS1, OSS2, mount on both clients)

      Start copytool on client 2:

      lhsmtool_posix -A 3 -A 4 -p /tmp /lustre
      

      Start test script on client 1:

      arch=3
      rm -f /lustre/mytest 
      echo thisfile >/lustre/mytest
      lfs hsm_archive -a $arch /lustre/mytest
      sleep 3
      lfs hsm_state /lustre/mytest
      sleep 3
      lfs hsm_release /lustre/mytest
      sleep 3
      lfs hsm_state /lustre/mytest
      sleep 3
      cat /etc/passwd >>/lustre/mytest
      sleep 3
      lfs hsm_state /lustre/mytest
      sleep 3
      lfs hsm_archive -a $arch /lustre/mytest
      sleep 3
      lfs hsm_state /lustre/mytest
      sleep 5
      lfs hsm_state /lustre/mytest
      

      Test script output

      /lustre/mytest: (0x00000009) exists archived, archive_id:3
      /lustre/mytest: (0x0000000d) released exists archived, archive_id:3
      /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
      /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
      /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
      

      Observations while watching the MDS syslog:

      Sep 17 10:39:06 lustre-mds1 kernel: LustreError: 1396:0:(mdt_coordinator.c:1463:mdt_hsm_update_request_state()) lsttest-MDT0000: Cannot find running request for cookie 0x54194821 on fid=[0x200003ab0:0x1:0x0]
      

      Observations on the copytool side:

      lhsmtool_posix[1230]: action=0 src=(null) dst=(null) mount_point=/lustre
      lhsmtool_posix[1230]: waiting for message from kernel
      lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
      lhsmtool_posix[1230]: waiting for message from kernel
      lhsmtool_posix[1231]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x5419481f
      lhsmtool_posix[1231]: processing file 'mytest'
      lhsmtool_posix[1231]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1231]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
      lhsmtool_posix[1231]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1231]: data archiving for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' done
      lhsmtool_posix[1231]: attr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1231]: fsetxattr of 'trusted.hsm' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
      lhsmtool_posix[1231]: fsetxattr of 'trusted.link' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
      lhsmtool_posix[1231]: fsetxattr of 'trusted.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
      lhsmtool_posix[1231]: fsetxattr of 'trusted.lma' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
      lhsmtool_posix[1231]: fsetxattr of 'lustre.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=-1 (Operation not supported)
      lhsmtool_posix[1231]: xattr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1231]: symlink '/tmp/shadow/mytest' to '../0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' done
      lhsmtool_posix[1231]: Action completed, notifying coordinator cookie=0x5419481f, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
      lhsmtool_posix[1231]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0)
      lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
      lhsmtool_posix[1230]: waiting for message from kernel
      lhsmtool_posix[1232]: '[0x200003ab0:0x1:0x0]' action RESTORE reclen 72, cookie=0x54194820
      lhsmtool_posix[1232]: processing file 'mytest'
      lhsmtool_posix[1232]: reading stripe rules from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0.lov' for '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0'
      lhsmtool_posix[1232]: restoring data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]'
      lhsmtool_posix[1232]: going to copy data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]'
      lhsmtool_posix[1232]: data restore from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]' done
      lhsmtool_posix[1232]: Action completed, notifying coordinator cookie=0x54194820, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
      lhsmtool_posix[1232]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0)
      lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
      lhsmtool_posix[1230]: waiting for message from kernel
      lhsmtool_posix[1233]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x54194821
      lhsmtool_posix[1233]: processing file 'mytest'
      lhsmtool_posix[1233]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1233]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
      lhsmtool_posix[1233]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
      lhsmtool_posix[1233]: progress ioctl for copy '/lustre/.lustre/fid/0x200003ab0:0x1:0x0'->'/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' failed: No such file or directory (2)
      lhsmtool_posix[1233]: data copy failed from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp': No such file or directory (2)
      lhsmtool_posix[1233]: Action completed, notifying coordinator cookie=0x54194821, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=2
      lhsmtool_posix[1233]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' failed: No such file or directory (2)
      Immediately after the test, mdt/*/hsm/actions shows the following:
      lrh=[type=10680000 len=136 idx=11/3] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x5419481f/0x5419481f action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
      lrh=[type=10680000 len=136 idx=11/6] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194820/0x54194820 action=RESTORE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
      lrh=[type=10680000 len=136 idx=11/9] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194821/0x54194821 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=FAILED data=[]
      

      A second, later retry of the archive request sometimes works fine and it also makes a difference if the copytool and the command line actions run on the same client node or on different nodes.

      I already added this as a comment to the possibly related LU-4004, but as this is closed, it doesn't seem to get attention any longer.

      Attachments

        Issue Links

          Activity

            People

              jhammond John Hammond
              mkukat Michael Kukat
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: