Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5216

HSM: restore doesn't restart for new copytool

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.5.1
    • Centos 6.5, Lustre 2.5.56
    • 3
    • 14550

    Description

      When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

      For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

      # dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
      # lfs hsm_archive /mnt/lustre/bigf
      # lfs hsm_release  /mnt/lustre/bigf
      # sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
      # md5sum /mnt/lustre/bigf
      

      During the restoration, kill the copytool, so no complete event is sent to the MDS.

      Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

      When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

      # cat /proc/1675/stack 
      [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
      [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
      [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
      [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
      [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
      [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
      [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81189365>] vfs_read+0xb5/0x1a0
      [<ffffffff811894a1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      It is also unkillable.

      If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

      # cat /proc/1723/stack 
      [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
      [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
      [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
      [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
      [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
      [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
      [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
      [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
      [<ffffffff8119df21>] sys_ioctl+0x81/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      However lfs hsm operation still works on other files.

      Attachments

        Issue Links

          Activity

            [LU-5216] HSM: restore doesn't restart for new copytool

            New patch (https://review.whamcloud.com/#/c/31105/) available including only first part of previous patch (https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/).

            This patch keeps only the first fix:
            Unexpected client (data mover node) eviction could
            cause on going hsm requests to be stuck in "STARTED"
            state as the copy tool running on the data mover node
            is not available anymore and requests could not be
            finished. This patch unregisters the copy tool and
            cancels all the requests on the copytool's agent.

            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - New patch ( https://review.whamcloud.com/#/c/31105/ ) available including only first part of previous patch ( https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/ ). This patch keeps only the first fix: Unexpected client (data mover node) eviction could cause on going hsm requests to be stuck in "STARTED" state as the copy tool running on the data mover node is not available anymore and requests could not be finished. This patch unregisters the copy tool and cancels all the requests on the copytool's agent.

            Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105
            Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

            gerrit Gerrit Updater added a comment - Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105 Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/
            Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: be4507fb45074ad24208c494f98a00da90b13665

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/ Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: Commit: be4507fb45074ad24208c494f98a00da90b13665

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615
            Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a57dcdbca18f9721169f06c312519c619856a656

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615 Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a57dcdbca18f9721169f06c312519c619856a656
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24238/
            Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 462c7aae05dfc9cd730f44ffdc661c4c36294012

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24238/ Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: Commit: 462c7aae05dfc9cd730f44ffdc661c4c36294012

            Hello !

            Vinayak will not work on this ticket anymore.

            Please let me know if I can help with review or other things to move it forward.

             

             

            scherementsev Sergey Cheremencev added a comment - Hello ! Vinayak will not work on this ticket anymore. Please let me know if I can help with review or other things to move it forward.    

            Tests with patch #24238 applied

            Single node setup, normal kill (no eviction).

            [root@client testfs]# lfs hsm_archive testfile
            [root@client testfs]# lfs hsm_release testfile
            [root@client testfs]# md5sum testfile #killing copytool at this point and restarting it
            md5sum: testfile: No data available
            

            As specified before, in case of normal kill (server still online, no eviction): wait until the HSM actions complete (active_request_timeout reached) so processes waiting on hsm actions can end.

            Two CT nodes setup, abrupt shutdown + starting new copytool on another node (active_request_timeout set to 600s).

            [root@client2 testfs]# dd if=/dev/urandom of=./testfile4 bs=1M count=1024
            1024+0 records in
            1024+0 records out
            1073741824 bytes (1.1 GB) copied, 62.7363 s, 17.1 MB/s
            [root@client2 testfs]# lfs hsm_archive testfile4
            [root@client2 testfs]# lfs hsm_release testfile4
            [root@client2 testfs]# md5sum testfile4
            
            On copytool:
            1506354374.424365 lhsmtool_posix[2150]: '[0x200000404:0x3:0x0]' action RESTORE reclen 72, cookie=0x59c91bb3
            1506354374.426135 lhsmtool_posix[2150]: processing file 'testfile4'
            1506354374.426183 lhsmtool_posix[2150]: reading stripe rules from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0.lov' for '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0'
            1506354374.443349 lhsmtool_posix[2150]: restoring data from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]'
            1506354374.449763 lhsmtool_posix[2150]: start copy of 1073741824 bytes from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]'
            
            Abrupt shutdown of client running copytool.
            Starting another copytool on another client.
            
            [root@client3 ~]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
            1506354389.090785 lhsmtool_posix[53551]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
            1506354389.094404 lhsmtool_posix[53551]: waiting for message from kernel
            
            Waiting now for the MDS to evict first node than ran copytool:
            
            Sep 25 17:49:56 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client d90b0949-266c-b31e-93a6-e0ea0af9532c (at 192.168.10.131@tcp) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8800b245dc00, cur 1506354596 expire 1506354446 last 1506354362
            
            [root@mds1 lustre]# cat mdt/*/hsm/actions
            lrh=[type=10680000 len=136 idx=1/8] fid=[0x200000404:0x3:0x0] dfid=[0x200000404:0x3:0x0] compound/cookie=0x59c91b6d/0x59c91bb3 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
            
            Restore is still started and will timeout when active_request_timeout is reached.
            

            Two nodes setup, two CT running at start.

            [root@client2 testfs]# lfs hsm_archive testfile5
            [root@client2 testfs]# lfs hsm_release testfile5
            [root@client2 testfs]# md5sum testfile5
            
            # Abrupt shutdown on the node restoring.
            # MDS detecting failure
            Sep 25 18:02:50 mds1 kernel: LustreError: 77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (d90b0949-266c-b31e-93a6-e0ea0af9532c) not found, unregistering: rc = -2
            # Client evicted
            Sep 25 18:09:12 mds1 kernel: Lustre: MGS: haven't heard from client f23e7274-1b60-37b1-a634-1cbc55f4c8c6 (at 192.168.10.139@tcp) in 232 seconds. I think it's dead, and I am evicting it. exp ffff880046b74400, cur 1506355752 expire 1506355602 last 1506355520
            Sep 25 18:09:19 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client be0c72ea-cfac-a42f-35a4-2fe303653d88 (at 192.168.10.139@tcp) in 239 seconds. I think it's dead, and I am evicting it. exp ffff880046b75800, cur 1506355759 expire 1506355609 last 1506355520
            [root@mds1 lustre]# cat mdt/*/hsm/actions
            lrh=[type=10680000 len=136 idx=1/11] fid=[0x200000404:0x5:0x0] dfid=[0x200000404:0x5:0x0] compound/cookie=0x59c91b70/0x59c91bb6 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
            
            But still had to wait the timeout for the process to end (and other requests to same file from different clients work).
            Checking the logs I see that the action is canceled before the agent is unregistered (as specified in point 1 of commit message).
            
            00000004:20000000:0.0:1506356124.870413:0:77432:0:(mdt_coordinator.c:1565:mdt_hsm_update_request_state()) Not updating record: fid=[0x200000404:0x5:0x0] cookie=0x59c91bb6 action=RESTORE status=FAILED
            00000004:00020000:0.0:1506356125.870598:0:77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (be0c72ea-cfac-a42f-35a4-2fe303653d88) not found, unregistering: rc = -2
            00000004:00000001:0.0:1506356125.874145:0:77432:0:(mdt_hsm_cdt_agent.c:198:mdt_hsm_agent_unregister()) Process entered
            00000004:00000010:0.0:1506356125.874151:0:77432:0:(mdt_hsm_cdt_agent.c:217:mdt_hsm_agent_unregister()) kfreed 'ha->ha_archive_id': 4 at ffff880046585990.
            00000004:00000010:0.0:1506356125.874155:0:77432:0:(mdt_hsm_cdt_agent.c:218:mdt_hsm_agent_unregister()) kfreed 'ha': 80 at ffff880108d89480.
            00000004:00000001:0.0:1506356125.874158:0:77432:0:(mdt_hsm_cdt_agent.c:220:mdt_hsm_agent_unregister()) Process leaving via out (rc=0 : 0 : 0x0)
            00000004:20000000:0.0:1506356125.874160:0:77432:0:(mdt_hsm_cdt_agent.c:222:mdt_hsm_agent_unregister()) agent be0c72ea-cfac-a42f-35a4-2fe303653d88 unregistration: 0
            

            Needs to do one last test: HA setup with normal kill (no abrupt shutdown, point 2 of commit message).

            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - Tests with patch #24238 applied Single node setup, normal kill (no eviction). [root@client testfs]# lfs hsm_archive testfile [root@client testfs]# lfs hsm_release testfile [root@client testfs]# md5sum testfile #killing copytool at this point and restarting it md5sum: testfile: No data available As specified before, in case of normal kill (server still online, no eviction): wait until the HSM actions complete (active_request_timeout reached) so processes waiting on hsm actions can end. Two CT nodes setup, abrupt shutdown + starting new copytool on another node (active_request_timeout set to 600s). [root@client2 testfs]# dd if=/dev/urandom of=./testfile4 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 62.7363 s, 17.1 MB/s [root@client2 testfs]# lfs hsm_archive testfile4 [root@client2 testfs]# lfs hsm_release testfile4 [root@client2 testfs]# md5sum testfile4 On copytool: 1506354374.424365 lhsmtool_posix[2150]: '[0x200000404:0x3:0x0]' action RESTORE reclen 72, cookie=0x59c91bb3 1506354374.426135 lhsmtool_posix[2150]: processing file 'testfile4' 1506354374.426183 lhsmtool_posix[2150]: reading stripe rules from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0.lov' for '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' 1506354374.443349 lhsmtool_posix[2150]: restoring data from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]' 1506354374.449763 lhsmtool_posix[2150]: start copy of 1073741824 bytes from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]' Abrupt shutdown of client running copytool. Starting another copytool on another client. [root@client3 ~]# lhsmtool_posix --hsm-root /storage/HSM/ --archive=1 /lustre/testfs 1506354389.090785 lhsmtool_posix[53551]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs 1506354389.094404 lhsmtool_posix[53551]: waiting for message from kernel Waiting now for the MDS to evict first node than ran copytool: Sep 25 17:49:56 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client d90b0949-266c-b31e-93a6-e0ea0af9532c (at 192.168.10.131@tcp) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8800b245dc00, cur 1506354596 expire 1506354446 last 1506354362 [root@mds1 lustre]# cat mdt/*/hsm/actions lrh=[type=10680000 len=136 idx=1/8] fid=[0x200000404:0x3:0x0] dfid=[0x200000404:0x3:0x0] compound/cookie=0x59c91b6d/0x59c91bb3 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[] Restore is still started and will timeout when active_request_timeout is reached. Two nodes setup, two CT running at start. [root@client2 testfs]# lfs hsm_archive testfile5 [root@client2 testfs]# lfs hsm_release testfile5 [root@client2 testfs]# md5sum testfile5 # Abrupt shutdown on the node restoring. # MDS detecting failure Sep 25 18:02:50 mds1 kernel: LustreError: 77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (d90b0949-266c-b31e-93a6-e0ea0af9532c) not found, unregistering: rc = -2 # Client evicted Sep 25 18:09:12 mds1 kernel: Lustre: MGS: haven't heard from client f23e7274-1b60-37b1-a634-1cbc55f4c8c6 (at 192.168.10.139@tcp) in 232 seconds. I think it's dead, and I am evicting it. exp ffff880046b74400, cur 1506355752 expire 1506355602 last 1506355520 Sep 25 18:09:19 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client be0c72ea-cfac-a42f-35a4-2fe303653d88 (at 192.168.10.139@tcp) in 239 seconds. I think it's dead, and I am evicting it. exp ffff880046b75800, cur 1506355759 expire 1506355609 last 1506355520 [root@mds1 lustre]# cat mdt/*/hsm/actions lrh=[type=10680000 len=136 idx=1/11] fid=[0x200000404:0x5:0x0] dfid=[0x200000404:0x5:0x0] compound/cookie=0x59c91b70/0x59c91bb6 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[] But still had to wait the timeout for the process to end (and other requests to same file from different clients work). Checking the logs I see that the action is canceled before the agent is unregistered (as specified in point 1 of commit message). 00000004:20000000:0.0:1506356124.870413:0:77432:0:(mdt_coordinator.c:1565:mdt_hsm_update_request_state()) Not updating record: fid=[0x200000404:0x5:0x0] cookie=0x59c91bb6 action=RESTORE status=FAILED 00000004:00020000:0.0:1506356125.870598:0:77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (be0c72ea-cfac-a42f-35a4-2fe303653d88) not found, unregistering: rc = -2 00000004:00000001:0.0:1506356125.874145:0:77432:0:(mdt_hsm_cdt_agent.c:198:mdt_hsm_agent_unregister()) Process entered 00000004:00000010:0.0:1506356125.874151:0:77432:0:(mdt_hsm_cdt_agent.c:217:mdt_hsm_agent_unregister()) kfreed 'ha->ha_archive_id': 4 at ffff880046585990. 00000004:00000010:0.0:1506356125.874155:0:77432:0:(mdt_hsm_cdt_agent.c:218:mdt_hsm_agent_unregister()) kfreed 'ha': 80 at ffff880108d89480. 00000004:00000001:0.0:1506356125.874158:0:77432:0:(mdt_hsm_cdt_agent.c:220:mdt_hsm_agent_unregister()) Process leaving via out (rc=0 : 0 : 0x0) 00000004:20000000:0.0:1506356125.874160:0:77432:0:(mdt_hsm_cdt_agent.c:222:mdt_hsm_agent_unregister()) agent be0c72ea-cfac-a42f-35a4-2fe303653d88 unregistration: 0 Needs to do one last test: HA setup with normal kill (no abrupt shutdown, point 2 of commit message).

            Stared to test with current master release (patch #24238 not applied) as lot of HSM code landed since last year tests.

            For a single node copytool setup, same behavior observed as before:

            [root@client ~]# dd if=/dev/urandom of=/lustre/testfs/testfile bs=$((1024*1024)) count=1024
            1024+0 records in
            1024+0 records out
            1073741824 bytes (1.1 GB) copied, 62.3971 s, 17.2 MB/s
            [root@client ~]# lfs hsm_archive /lustre/testfs/testfile
            [root@client ~]# lfs hsm_release /lustre/testfs/testfile
            [root@client ~]# md5sum /lustre/testfs/testfile
            # while md5sum has been started, kill the copytool (restoring)
            [root@client2 ~]# lhsmtool_posix  --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs
            1506345768.797131 lhsmtool_posix[2218]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
            1506345768.801292 lhsmtool_posix[2218]: waiting for message from kernel
            1506345788.128256 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1
            1506345788.128356 lhsmtool_posix[2218]: waiting for message from kernel
            1506345788.128407 lhsmtool_posix[2219]: '[0x200000402:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x59c8fec7
            1506345788.129443 lhsmtool_posix[2219]: processing file 'testfile'
            1506345788.182456 lhsmtool_posix[2219]: archiving '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
            1506345788.184770 lhsmtool_posix[2219]: saving stripe info of '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' in /lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp.lov
            1506345788.186209 lhsmtool_posix[2219]: start copy of 1073741824 bytes from '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
            1506345802.241652 lhsmtool_posix[2219]: copied 1073741824 bytes in 14.056800 seconds
            1506345802.711506 lhsmtool_posix[2219]: data archiving for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' done
            1506345802.711621 lhsmtool_posix[2219]: attr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
            1506345802.712621 lhsmtool_posix[2219]: fsetxattr of 'trusted.hsm' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
            1506345802.712645 lhsmtool_posix[2219]: fsetxattr of 'trusted.link' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
            1506345802.712660 lhsmtool_posix[2219]: fsetxattr of 'trusted.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
            1506345802.712673 lhsmtool_posix[2219]: fsetxattr of 'trusted.lma' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
            1506345802.712684 lhsmtool_posix[2219]: fsetxattr of 'lustre.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=-1 (Operation not supported)
            1506345802.712688 lhsmtool_posix[2219]: xattr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
            1506345802.713441 lhsmtool_posix[2219]: symlink '/lustre/HSM//shadow/testfile' to '../0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' done
            1506345802.713902 lhsmtool_posix[2219]: Action completed, notifying coordinator cookie=0x59c8fec7, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0
            1506345802.714923 lhsmtool_posix[2219]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0)
            1506345837.128875 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1
            1506345837.128965 lhsmtool_posix[2218]: waiting for message from kernel
            1506345837.128998 lhsmtool_posix[2223]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec8
            1506345837.130409 lhsmtool_posix[2223]: processing file 'testfile'
            1506345837.130450 lhsmtool_posix[2223]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0'
            1506345837.139686 lhsmtool_posix[2223]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]'
            1506345837.145800 lhsmtool_posix[2223]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]'
            ^Cexiting: Interrupt
            
            #Restart copytool
            [root@client2 ~]# lhsmtool_posix  --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs
            1506345854.943013 lhsmtool_posix[2224]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
            1506345854.944935 lhsmtool_posix[2224]: waiting for message from kernel
            
            #md5sum process hangs until active_request_timeout is reached (unkillable process)
            [root@client testfs]# cat /proc/2731/stack
            [<ffffffffa08b7791>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc]
            [<ffffffffa08b9448>] ldlm_cli_enqueue_fini+0x938/0xdb0 [ptlrpc]
            [<ffffffffa08bc3b2>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
            [<ffffffffa0afc1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc]
            [<ffffffffa0afe64b>] mdc_intent_lock+0x26b/0x520 [mdc]
            [<ffffffffa0b2f09f>] lmv_intent_lock+0x5cf/0x1b50 [lmv]
            [<ffffffffa0b6eec3>] ll_layout_refresh+0x503/0xfb0 [lustre]
            [<ffffffffa0bb132b>] vvp_io_fini+0xfb/0x2c0 [lustre]
            [<ffffffffa06d5275>] cl_io_fini+0x75/0x270 [obdclass]
            [<ffffffffa0b61a5b>] ll_file_io_generic+0x41b/0xb50 [lustre]
            [<ffffffffa0b62a47>] ll_file_aio_read+0x347/0x3e0 [lustre]
            [<ffffffffa0b62bbb>] ll_file_read+0xdb/0x3d0 [lustre]
            [<ffffffff811fdf9e>] vfs_read+0x9e/0x170
            [<ffffffff811feb6f>] SyS_read+0x7f/0xe0
            [<ffffffff81696709>] system_call_fastpath+0x16/0x1b
            [<ffffffffffffffff>] 0xffffffffffffffff
            
            #At the 2mn mark
            [root@client ~]# md5sum /lustre/testfs/testfile
            md5sum: /lustre/testfs/testfile: No data available
            
            Copytool ended restoring the file after the 2mn mark
            1506346268.134941 lhsmtool_posix[2224]: copytool fs=testfs archive#=1 item_count=1
            1506346268.135206 lhsmtool_posix[2224]: waiting for message from kernel
            1506346268.135286 lhsmtool_posix[2249]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec9
            1506346268.138556 lhsmtool_posix[2249]: processing file 'testfile'
            1506346268.138618 lhsmtool_posix[2249]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0'
            1506346268.149043 lhsmtool_posix[2249]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]'
            1506346268.154331 lhsmtool_posix[2249]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]'
            1506346279.502113 lhsmtool_posix[2249]: copied 1073741824 bytes in 11.350319 seconds
            1506346279.502147 lhsmtool_posix[2249]: data restore from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]' done
            1506346279.502152 lhsmtool_posix[2249]: Action completed, notifying coordinator cookie=0x59c8fec9, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0
            1506346279.550123 lhsmtool_posix[2249]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0)
            

            Test with two nodes copytool setup: the copytool started on second node after killing the first instance on primary node never restored the file.
            I had to wait until md5sum failed (active_request_timeout) and retrigger md5sum to restore the file.

            [root@client testfs]# lfs hsm_release testfile
            [root@client testfs]# md5sum testfile
            
            # killing copytool while restoring
            [root@client2 HSM]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
            1506348240.935844 lhsmtool_posix[2304]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
            1506348240.939058 lhsmtool_posix[2304]: waiting for message from kernel
            1506348298.780877 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1
            1506348298.780976 lhsmtool_posix[2304]: waiting for message from kernel
            1506348298.781022 lhsmtool_posix[2305]: '[0x200000402:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x59c90b5a
            1506348298.782327 lhsmtool_posix[2305]: processing file 'testfile'
            [...]
            1506348317.463070 lhsmtool_posix[2305]: Action completed, notifying coordinator cookie=0x59c90b5a, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0
            1506348317.463947 lhsmtool_posix[2305]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0)
            1506348507.782112 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1
            1506348507.783689 lhsmtool_posix[2304]: waiting for message from kernel
            1506348507.783731 lhsmtool_posix[2334]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5b
            1506348507.785689 lhsmtool_posix[2334]: processing file 'testfile'
            1506348507.785800 lhsmtool_posix[2334]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0'
            1506348507.799790 lhsmtool_posix[2334]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]'
            1506348507.805419 lhsmtool_posix[2334]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]'
            ^Cexiting: Interrupt
            
             # starting another copytool 
            
            [root@client2 HSM]# ssh client3
            Last login: Mon Sep 25 16:02:17 2017 from client2
            [root@client3 ~]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
            1506348526.396743 lhsmtool_posix[2289]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
            1506348526.404385 lhsmtool_posix[2289]: waiting for message from kernel
            
            # actions on the MDS:
            [root@mds1 lustre]# cat mdt/*/hsm/actions
            lrh=[type=10680000 len=136 idx=1/3] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x59c90a8c/0x59c90b5b action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=CANCELED data=[]
            
            # after the 2mn mark
            [root@client testfs]# md5sum testfile
            md5sum: testfile: No data available
            
            # then nothing happened on the copytool (action CANCELED)
            # retrigger md5sum on testfile works (restored)
            
            [root@client testfs]# md5sum testfile
            f1f873e4cad553d692cfefa93ff381a9  testfile
            1506348841.783293 lhsmtool_posix[2289]: copytool fs=testfs archive#=1 item_count=1
            1506348841.783481 lhsmtool_posix[2289]: waiting for message from kernel
            1506348841.783565 lhsmtool_posix[2290]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5c
            1506348841.786109 lhsmtool_posix[2290]: processing file 'testfile'
            1506348841.786163 lhsmtool_posix[2290]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0'
            1506348841.820678 lhsmtool_posix[2290]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]'
            1506348841.826554 lhsmtool_posix[2290]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]'
            1506348853.333816 lhsmtool_posix[2290]: copied 1073741824 bytes in 11.509827 seconds
            1506348853.333871 lhsmtool_posix[2290]: data restore from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]' done
            1506348853.334950 lhsmtool_posix[2290]: Action completed, notifying coordinator cookie=0x59c90b5c, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0
            1506348853.405275 lhsmtool_posix[2290]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0)
            

            So it's same as Frank said MDS should have restarted the restore operation when the new copytool registered.

            Now testing with patch #24238.

            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - Stared to test with current master release (patch #24238 not applied) as lot of HSM code landed since last year tests. For a single node copytool setup, same behavior observed as before: [root@client ~]# dd if=/dev/urandom of=/lustre/testfs/testfile bs=$((1024*1024)) count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 62.3971 s, 17.2 MB/s [root@client ~]# lfs hsm_archive /lustre/testfs/testfile [root@client ~]# lfs hsm_release /lustre/testfs/testfile [root@client ~]# md5sum /lustre/testfs/testfile # while md5sum has been started, kill the copytool (restoring) [root@client2 ~]# lhsmtool_posix --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs 1506345768.797131 lhsmtool_posix[2218]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs 1506345768.801292 lhsmtool_posix[2218]: waiting for message from kernel 1506345788.128256 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1 1506345788.128356 lhsmtool_posix[2218]: waiting for message from kernel 1506345788.128407 lhsmtool_posix[2219]: '[0x200000402:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x59c8fec7 1506345788.129443 lhsmtool_posix[2219]: processing file 'testfile' 1506345788.182456 lhsmtool_posix[2219]: archiving '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' 1506345788.184770 lhsmtool_posix[2219]: saving stripe info of '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' in /lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp.lov 1506345788.186209 lhsmtool_posix[2219]: start copy of 1073741824 bytes from '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' 1506345802.241652 lhsmtool_posix[2219]: copied 1073741824 bytes in 14.056800 seconds 1506345802.711506 lhsmtool_posix[2219]: data archiving for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' done 1506345802.711621 lhsmtool_posix[2219]: attr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' 1506345802.712621 lhsmtool_posix[2219]: fsetxattr of 'trusted.hsm' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success) 1506345802.712645 lhsmtool_posix[2219]: fsetxattr of 'trusted.link' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success) 1506345802.712660 lhsmtool_posix[2219]: fsetxattr of 'trusted.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success) 1506345802.712673 lhsmtool_posix[2219]: fsetxattr of 'trusted.lma' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success) 1506345802.712684 lhsmtool_posix[2219]: fsetxattr of 'lustre.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=-1 (Operation not supported) 1506345802.712688 lhsmtool_posix[2219]: xattr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' 1506345802.713441 lhsmtool_posix[2219]: symlink '/lustre/HSM//shadow/testfile' to '../0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' done 1506345802.713902 lhsmtool_posix[2219]: Action completed, notifying coordinator cookie=0x59c8fec7, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0 1506345802.714923 lhsmtool_posix[2219]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0) 1506345837.128875 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1 1506345837.128965 lhsmtool_posix[2218]: waiting for message from kernel 1506345837.128998 lhsmtool_posix[2223]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec8 1506345837.130409 lhsmtool_posix[2223]: processing file 'testfile' 1506345837.130450 lhsmtool_posix[2223]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' 1506345837.139686 lhsmtool_posix[2223]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]' 1506345837.145800 lhsmtool_posix[2223]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]' ^Cexiting: Interrupt #Restart copytool [root@client2 ~]# lhsmtool_posix --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs 1506345854.943013 lhsmtool_posix[2224]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs 1506345854.944935 lhsmtool_posix[2224]: waiting for message from kernel #md5sum process hangs until active_request_timeout is reached (unkillable process) [root@client testfs]# cat /proc/2731/stack [<ffffffffa08b7791>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc] [<ffffffffa08b9448>] ldlm_cli_enqueue_fini+0x938/0xdb0 [ptlrpc] [<ffffffffa08bc3b2>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc] [<ffffffffa0afc1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc] [<ffffffffa0afe64b>] mdc_intent_lock+0x26b/0x520 [mdc] [<ffffffffa0b2f09f>] lmv_intent_lock+0x5cf/0x1b50 [lmv] [<ffffffffa0b6eec3>] ll_layout_refresh+0x503/0xfb0 [lustre] [<ffffffffa0bb132b>] vvp_io_fini+0xfb/0x2c0 [lustre] [<ffffffffa06d5275>] cl_io_fini+0x75/0x270 [obdclass] [<ffffffffa0b61a5b>] ll_file_io_generic+0x41b/0xb50 [lustre] [<ffffffffa0b62a47>] ll_file_aio_read+0x347/0x3e0 [lustre] [<ffffffffa0b62bbb>] ll_file_read+0xdb/0x3d0 [lustre] [<ffffffff811fdf9e>] vfs_read+0x9e/0x170 [<ffffffff811feb6f>] SyS_read+0x7f/0xe0 [<ffffffff81696709>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff #At the 2mn mark [root@client ~]# md5sum /lustre/testfs/testfile md5sum: /lustre/testfs/testfile: No data available Copytool ended restoring the file after the 2mn mark 1506346268.134941 lhsmtool_posix[2224]: copytool fs=testfs archive#=1 item_count=1 1506346268.135206 lhsmtool_posix[2224]: waiting for message from kernel 1506346268.135286 lhsmtool_posix[2249]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec9 1506346268.138556 lhsmtool_posix[2249]: processing file 'testfile' 1506346268.138618 lhsmtool_posix[2249]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' 1506346268.149043 lhsmtool_posix[2249]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]' 1506346268.154331 lhsmtool_posix[2249]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]' 1506346279.502113 lhsmtool_posix[2249]: copied 1073741824 bytes in 11.350319 seconds 1506346279.502147 lhsmtool_posix[2249]: data restore from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]' done 1506346279.502152 lhsmtool_posix[2249]: Action completed, notifying coordinator cookie=0x59c8fec9, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0 1506346279.550123 lhsmtool_posix[2249]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0) Test with two nodes copytool setup: the copytool started on second node after killing the first instance on primary node never restored the file. I had to wait until md5sum failed (active_request_timeout) and retrigger md5sum to restore the file. [root@client testfs]# lfs hsm_release testfile [root@client testfs]# md5sum testfile # killing copytool while restoring [root@client2 HSM]# lhsmtool_posix --hsm-root /storage/HSM/ --archive=1 /lustre/testfs 1506348240.935844 lhsmtool_posix[2304]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs 1506348240.939058 lhsmtool_posix[2304]: waiting for message from kernel 1506348298.780877 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1 1506348298.780976 lhsmtool_posix[2304]: waiting for message from kernel 1506348298.781022 lhsmtool_posix[2305]: '[0x200000402:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x59c90b5a 1506348298.782327 lhsmtool_posix[2305]: processing file 'testfile' [...] 1506348317.463070 lhsmtool_posix[2305]: Action completed, notifying coordinator cookie=0x59c90b5a, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0 1506348317.463947 lhsmtool_posix[2305]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0) 1506348507.782112 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1 1506348507.783689 lhsmtool_posix[2304]: waiting for message from kernel 1506348507.783731 lhsmtool_posix[2334]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5b 1506348507.785689 lhsmtool_posix[2334]: processing file 'testfile' 1506348507.785800 lhsmtool_posix[2334]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' 1506348507.799790 lhsmtool_posix[2334]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]' 1506348507.805419 lhsmtool_posix[2334]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]' ^Cexiting: Interrupt # starting another copytool [root@client2 HSM]# ssh client3 Last login: Mon Sep 25 16:02:17 2017 from client2 [root@client3 ~]# lhsmtool_posix --hsm-root /storage/HSM/ --archive=1 /lustre/testfs 1506348526.396743 lhsmtool_posix[2289]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs 1506348526.404385 lhsmtool_posix[2289]: waiting for message from kernel # actions on the MDS: [root@mds1 lustre]# cat mdt/*/hsm/actions lrh=[type=10680000 len=136 idx=1/3] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x59c90a8c/0x59c90b5b action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=CANCELED data=[] # after the 2mn mark [root@client testfs]# md5sum testfile md5sum: testfile: No data available # then nothing happened on the copytool (action CANCELED) # retrigger md5sum on testfile works (restored) [root@client testfs]# md5sum testfile f1f873e4cad553d692cfefa93ff381a9 testfile 1506348841.783293 lhsmtool_posix[2289]: copytool fs=testfs archive#=1 item_count=1 1506348841.783481 lhsmtool_posix[2289]: waiting for message from kernel 1506348841.783565 lhsmtool_posix[2290]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5c 1506348841.786109 lhsmtool_posix[2290]: processing file 'testfile' 1506348841.786163 lhsmtool_posix[2290]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' 1506348841.820678 lhsmtool_posix[2290]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]' 1506348841.826554 lhsmtool_posix[2290]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]' 1506348853.333816 lhsmtool_posix[2290]: copied 1073741824 bytes in 11.509827 seconds 1506348853.333871 lhsmtool_posix[2290]: data restore from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]' done 1506348853.334950 lhsmtool_posix[2290]: Action completed, notifying coordinator cookie=0x59c90b5c, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0 1506348853.405275 lhsmtool_posix[2290]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0) So it's same as Frank said MDS should have restarted the restore operation when the new copytool registered. Now testing with patch #24238.
            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - Patch https://review.whamcloud.com/#/c/24238/ rebased on master.

            Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/24238
            Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c2c9531d7005b63b1c8afd64c2d5ed22c404526e

            gerrit Gerrit Updater added a comment - Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/24238 Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c2c9531d7005b63b1c8afd64c2d5ed22c404526e

            People

              riauxjb Jean-Baptiste Riaux (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: