[LU-5216] HSM: restore doesn't restart for new copytool Created: 17/Jun/14  Updated: 25/Aug/18

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Frank Zago (Inactive) Assignee: Jean-Baptiste Riaux (Inactive)
Resolution: Unresolved Votes: 3
Labels: patch
Environment:

Centos 6.5, Lustre 2.5.56


Attachments: Microsoft Word LU-5216_solution.doc    
Issue Links:
Blocker
is blocking LU-10175 DoM:Full support for the LDLM lock co... Resolved
Duplicate
Related
is related to LU-11284 Full lock convert conflicts with HSM Open
is related to LU-8905 tests: sanity-hsm test_3[3-6] does no... Closed
Severity: 3
Rank (Obsolete): 14550

 Description   

When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

# dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
# lfs hsm_archive /mnt/lustre/bigf
# lfs hsm_release  /mnt/lustre/bigf
# sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
# md5sum /mnt/lustre/bigf

During the restoration, kill the copytool, so no complete event is sent to the MDS.

Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

# cat /proc/1675/stack 
[<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
[<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
[<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
[<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
[<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
[<ffffffff81189365>] vfs_read+0xb5/0x1a0
[<ffffffff811894a1>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

It is also unkillable.

If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

# cat /proc/1723/stack 
[<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
[<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
[<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
[<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
[<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
[<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
[<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
[<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
[<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
[<ffffffff8119df21>] sys_ioctl+0x81/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

However lfs hsm operation still works on other files.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 17/Jun/14 ]

The request timeout is set to 1 hour by default.

Will you find cut_active_req_time under procfs and set it to a shorter value and see how it goes?

Comment by Frank Zago (Inactive) [ 18/Jun/14 ]

I started clean on the MDS, with a 10 minutes timeout.

# cat active_requests
# cat actions
# echo 600 > active_request_timeout

I then recreated the "bigf" file, archived it and released
it. Waited until "stat" reported 1 block for "bigf", and
"active_requests" and "actions" were empty.

Started "md5sum bigf", and ctrl^C the copytool when it printed the following:

lhsmtool_posix[1761]: Going to copy 524288000 bytes /vsm/tasfs1/0001/0000/2b10/0000/0002/0000/0x200002b10:0x1:0x0 -> {VOLATILE}=[0x200002b10:0x3:0x0]

The copytool exited properly this time.

The MDS reported:

# cat actions
lrh=[type=10680000 len=136 idx=1/5] fid=[0x200002b10:0x1:0x0] dfid=[0x200002b10:0x1:0x0] compound/cookie=0x53a1adbc/0x53a1adbc action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
[root@tassrv01 hsm]# cat active_requests
fid=[0x200002b10:0x1:0x0] dfid=[0x200002b10:0x3:0x0] compound/cookie=0x53a1adbc/0x53a1adbc action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 data=[] canceled=0 uuid=b9816f33-2782-f150-2539-bfd258250bdc done=0

I then restarted the copytool, after 3 minutes.

At the 10 minutes mark, md5sum exited with an error:

# md5sum bigf
md5sum: bigf: No data available

Restarting the command yielded the correct result:

# md5sum bigf
f0bedd9e869415b74b8c2eb5ce0ce863  bigf

The MDS should have restarted the restore operation when the new
copytool registered, which might have prevented md5sum from getting a
ENODATA read error.

Comment by Frank Zago (Inactive) [ 18/Jun/14 ]

Second test, same protocol. But this time, the copytool didn't exit after the ctrl^C.

# ps -aef|grep posix
root      1767  1645  0 10:43 pts/1    00:00:01 [lhsmtool_posix] <defunct>
# cat /proc/1767/stack
[<ffffffff81076fe7>] do_exit+0x5b7/0x870
[<ffffffff810772f8>] do_group_exit+0x58/0xd0
[<ffffffff81077387>] sys_exit_group+0x17/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Another copytool was started.

At the 10 minutes mark, both the blocked copytool and md5sum exited.

# md5sum bigf
md5sum: bigf: No data available

Restarting md5sum didn't restore the file this time; it got
stuck. These are all the messages from the copytool, which show it got
stuck early:

# lhsmtool_posix --no-shadow -v --hsm-root /vsm/tasfs1 /mnt/tas01 -v
lhsmtool_posix[1880]: action=0 src=(null) dst=(null) mount_point=/mnt/tas01
lhsmtool_posix[1880]: waiting for message from kernel
lhsmtool_posix[1880]: copytool fs=tas01 archive#=1 item_count=1
lhsmtool_posix[1880]: waiting for message from kernel
lhsmtool_posix[1911]: '[0x200002b10:0x1:0x0]' action RESTORE reclen 72, cookie=0x53a1adbf
lhsmtool_posix[1911]: processing file 'bigf'
lhsmtool_posix[1911]: reading stripe rules from '/vsm/tasfs1/0001/0000/2b10/0000/0002/0000/0x200002b10:0x1:0x0.lov' for '/vsm/tasfs1/0001/0000/2b10/0000/0002/0000/0x200002b10:0x1:0x0'

A successfull restore would have had "restoring data from" and "going to copy data from" messages too.

The copytool is stuck, presumably reading from the llog pipe:

# ps -aef|grep posix
root      1880  1867  0 11:06 pts/3    00:00:00 lhsmtool_posix --no-shadow -v --hsm-root /vsm/tasfs1 /mnt/tas01 -v
[root@tasclient01 ~]# cat /proc/1880/stack
[<ffffffff8119368b>] pipe_wait+0x5b/0x80
[<ffffffff81194136>] pipe_read+0x3e6/0x4e0
[<ffffffff81188a7a>] do_sync_read+0xfa/0x140
[<ffffffff81189365>] vfs_read+0xb5/0x1a0
[<ffffffff811894a1>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

At the 10 minutes mark, md5sum ended in error again:

# md5sum bigf
md5sum: bigf: No data available

The copytool also printed some errors:

lhsmtool_posix[1911]: llapi_hsm_action_begin() on '/mnt/tas01/.lustre/fid/0x200002b10:0x1:0x0' failed: No such file or directory (2)
lhsmtool_posix[1911]: Action completed, notifying coordinator cookie=0x53a1adbf, FID=[0x200002b10:0x1:0x0], hp_flags=0 err=2
lhsmtool_posix[1911]: llapi_hsm_action_end() on '/mnt/tas01/.lustre/fid/0x200002b10:0x1:0x0' failed: No such file or directory (2)

There is no error on the MDS, and both "actions" ands "active_requests" are empty.

Once more, I restarted md5sum, and again it got stuck, in the same cycle as previously.

At that point I cannot restore that file without shutting down Lustre.

Comment by Joseph Nemeth (Inactive) [ 23/Jul/14 ]

Similar problem. Critical to our applications. Running on RHEL 6.5.

On MDS
lctl set_param mdt.accfs2-MDT0000.hsm_control=enabled
lctl set_param mdt.accfs2-MDT0000.hsm.active_request_timeout=120

On Lustre client
lhsmtool_posix --archive=1 --hsm-root /local /mnt/accfs2/

Note this is running in foreground on this window so I can capture log traces.

I create a small file, /mnt/accfs2/testfile1, then repeatedly perform the following:
lfs hsm_archive /mnt/accfs2/testfile1; sleep 1
lfs hsm_release /mnt/accfs2/testfile1; sleep 1
lfs hsm_restore /mnt/accfs2/testfile1; sleep 1
cat /mnt/accfs2/testfile1; sleep 1

This runs for a while. Then I try a single archive, followed by repeating:
lfs hsm_release /mnt/accfs2/testfile2; sleep 1
lfs hsm_restore /mnt/accfs2/testfile2; sleep 1
cat /mnt/accfs2/testfile2; sleep 1

This also runs for a while. Then I try repeating:
lfs hsm_release /mnt/accfs2/testfile3; sleep 1
cat /mnt/accfs2/testfile3; sleep 1

This invariably hangs on the second cat, for 120 seconds, then fails with a "cat: /mnt/accfs2/testfile3: No data available" error.

There are no interruptions taking place on the copytool. The copytool logging is illustrative.

During the first (successful) cat, I see something like this:

lhsmtool_posix[4230]: copytool fs=accfs2 archive#=1 item_count=1
lhsmtool_posix[4230]: waiting for message from kernel
lhsmtool_posix[4280]: '[0x200002340:0x4:0x0]' action RESTORE reclen 72, cookie=0x53cff58c
lhsmtool_posix[4280]: processing file 'testfile1'
lhsmtool_posix[4280]: reading stripe rules from '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0.lov' for '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0'
lhsmtool_posix[4280]: restoring data from '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0' to '

{VOLATILE}=[0x200002340:0x6:0x0]'
lhsmtool_posix[4280]: going to copy data from '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0' to '{VOLATILE}

=[0x200002340:0x6:0x0]'
lhsmtool_posix[4280]: data restore from '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0' to '

{VOLATILE}

=[0x200002340:0x6:0x0]' done
lhsmtool_posix[4280]: Action completed, notifying coordinator cookie=0x53cff58c, FID=[0x200002340:0x4:0x0], hp_flags=0 err=0
lhsmtool_posix[4280]: llapi_hsm_action_end() on '/mnt/accfs2//.lustre/fid/0x200002340:0x4:0x0' ok (rc=0)

During the second (failed) cat, I see something like this:

lhsmtool_posix[4230]: copytool fs=accfs2 archive#=1 item_count=1
lhsmtool_posix[4230]: waiting for message from kernel
lhsmtool_posix[4293]: '[0x200002340:0x4:0x0]' action RESTORE reclen 72, cookie=0x53cff58d
lhsmtool_posix[4293]: processing file 'testfile1'
lhsmtool_posix[4293]: reading stripe rules from '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0.lov' for '/local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0'
(120 second delay)
lhsmtool_posix[4293]: llapi_hsm_action_begin() on '/mnt/accfs2//.lustre/fid/0x200002340:0x4:0x0' failed: No such file or directory (2)
lhsmtool_posix[4293]: Action completed, notifying coordinator cookie=0x53cff58d, FID=[0x200002340:0x4:0x0], hp_flags=0 err=2
lhsmtool_posix[4293]: llapi_hsm_action_end() on '/mnt/accfs2//.lustre/fid/0x200002340:0x4:0x0' failed: No such file or directory (2)

During the 120 second delay, both the copytool and the cat operation are unkillable, and when I've tried, they become zombies that can only be eliminated by a full node reboot. After the 120 second delay, the cat terminates with failure, and the copytool seems to keep rolling along. However, the testfile is now irrecoverable. For example:

[root@n016 ~]# lfs hsm_state /mnt/accfs2/*
/mnt/accfs2/testfile1: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile10: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile2: (0x00000009) exists archived, archive_id:1
/mnt/accfs2/testfile3: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile4: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile5: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile6: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile7: (0x00000009) exists archived, archive_id:1
/mnt/accfs2/testfile8: (0x0000000d) released exists archived, archive_id:1
/mnt/accfs2/testfile9: (0x0000000d) released exists archived, archive_id:1

[root@n016 ~]# find /local ! -type d |xargs ls -l
rw-rr- 1 root root 47 Jul 23 10:50 /local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0
rw------ 1 root root 56 Jul 23 10:51 /local/0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0.lov
rw-rr- 1 root root 22 Jul 23 12:13 /local/0004/0000/2341/0000/0002/0000/0x200002341:0x4:0x0
rw------ 1 root root 56 Jul 23 12:13 /local/0004/0000/2341/0000/0002/0000/0x200002341:0x4:0x0.lov
rw-rr- 1 root root 35 Jul 23 11:09 /local/000d/0000/2340/0000/0002/0000/0x200002340:0xd:0x0
rw------ 1 root root 56 Jul 23 11:09 /local/000d/0000/2340/0000/0002/0000/0x200002340:0xd:0x0.lov
rw-rr- 1 root root 59 Jul 23 11:26 /local/0013/0000/2340/0000/0002/0000/0x200002340:0x13:0x0
rw------ 1 root root 56 Jul 23 11:27 /local/0013/0000/2340/0000/0002/0000/0x200002340:0x13:0x0.lov
rw-rr- 1 root root 32 Jul 23 11:30 /local/001b/0000/2340/0000/0002/0000/0x200002340:0x1b:0x0
rw------ 1 root root 56 Jul 23 11:30 /local/001b/0000/2340/0000/0002/0000/0x200002340:0x1b:0x0.lov
rw-rr- 1 root root 48 Jul 23 11:35 /local/0023/0000/2340/0000/0002/0000/0x200002340:0x23:0x0
rw------ 1 root root 56 Jul 23 11:35 /local/0023/0000/2340/0000/0002/0000/0x200002340:0x23:0x0.lov
rw-rr- 1 root root 33 Jul 23 11:53 /local/002b/0000/2340/0000/0002/0000/0x200002340:0x2b:0x0
rw------ 1 root root 56 Jul 23 11:53 /local/002b/0000/2340/0000/0002/0000/0x200002340:0x2b:0x0.lov
rw------ 1 root root 14 Jul 23 11:58 /local/0039/0000/2340/0000/0002/0000/0x200002340:0x39:0x0_tmp
rw------ 1 root root 56 Jul 23 11:58 /local/0039/0000/2340/0000/0002/0000/0x200002340:0x39:0x0_tmp.lov
rw-rr- 1 root root 21 Jul 23 12:17 /local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0
rw------ 1 root root 56 Jul 23 12:18 /local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0.lov
rw-rr- 1 root root 33 Jul 23 12:19 /local/006f/0000/2341/0000/0002/0000/0x200002341:0x6f:0x0
rw------ 1 root root 56 Jul 23 12:19 /local/006f/0000/2341/0000/0002/0000/0x200002341:0x6f:0x0.lov
rw-rr- 1 root root 50 Jan 17 2012 /local/.sgimc.partition.format.profile
lrwxrwxrwx 1 root root 52 Jul 23 10:51 /local/shadow/testfile1 -> ../0004/0000/2340/0000/0002/0000/0x200002340:0x4:0x0
lrwxrwxrwx 1 root root 53 Jul 23 12:19 /local/shadow/testfile10 -> ../006f/0000/2341/0000/0002/0000/0x200002341:0x6f:0x0
lrwxrwxrwx 1 root root 52 Jul 23 11:09 /local/shadow/testfile2 -> ../000d/0000/2340/0000/0002/0000/0x200002340:0xd:0x0
lrwxrwxrwx 1 root root 53 Jul 23 11:27 /local/shadow/testfile3 -> ../0013/0000/2340/0000/0002/0000/0x200002340:0x13:0x0
lrwxrwxrwx 1 root root 53 Jul 23 11:30 /local/shadow/testfile4 -> ../001b/0000/2340/0000/0002/0000/0x200002340:0x1b:0x0
lrwxrwxrwx 1 root root 53 Jul 23 11:35 /local/shadow/testfile5 -> ../0023/0000/2340/0000/0002/0000/0x200002340:0x23:0x0
lrwxrwxrwx 1 root root 53 Jul 23 11:53 /local/shadow/testfile6 -> ../002b/0000/2340/0000/0002/0000/0x200002340:0x2b:0x0
lrwxrwxrwx 1 root root 52 Jul 23 12:13 /local/shadow/testfile8 -> ../0004/0000/2341/0000/0002/0000/0x200002341:0x4:0x0
lrwxrwxrwx 1 root root 53 Jul 23 12:18 /local/shadow/testfile9 -> ../0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0

[root@n016 ~]# cat /local/shadow/testfile9
Here's the next try.

So testfile9 is in a released/archived state, and the HSM directory seems to have the intact data. More specifically:

[root@n016 ~]# od -t x1 /local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0.lov
0000000 d0 0b d1 0b 01 00 00 00 62 00 00 00 00 00 00 00
0000020 41 23 00 00 02 00 00 00 00 00 10 00 01 00 ff ff
0000040 52 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000060 00 00 00 00 03 00 00 00
0000070

[root@n016 ~]# cat /local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0
Here's the next try.

I can't evaluate whether the *.lov values are correct.

Any attempt to restore this fails:

[root@n016 ~]# lfs hsm_restore /local/shadow/testfile9
Cannot read FID of /local/shadow/testfile9: No data available

Copytool is silent.

[root@n016 ~]# cat /mnt/accfs2/testfile9
(120 second delay)
cat: /mnt/accfs2/testfile9: No data available

Copytool reports this:

lhsmtool_posix[3638]: copytool fs=accfs2 archive#=1 item_count=1
lhsmtool_posix[3638]: waiting for message from kernel
lhsmtool_posix[4540]: '[0x200002341:0x62:0x0]' action RESTORE reclen 72, cookie=0x53cff6bd
lhsmtool_posix[4540]: processing file 'testfile9'
lhsmtool_posix[4540]: reading stripe rules from '/local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0.lov' for '/local/0062/0000/2341/0000/0002/0000/0x200002341:0x62:0x0'
(120 second delay)
lhsmtool_posix[4540]: llapi_hsm_action_begin() on '/mnt/accfs2//.lustre/fid/0x200002341:0x62:0x0' failed: No such file or directory (2)
lhsmtool_posix[4540]: Action completed, notifying coordinator cookie=0x53cff6bd, FID=[0x200002341:0x62:0x0], hp_flags=0 err=2
lhsmtool_posix[4540]: llapi_hsm_action_end() on '/mnt/accfs2//.lustre/fid/0x200002341:0x62:0x0' failed: No such file or directory (2)

Comment by Robert Read (Inactive) [ 23/Jul/14 ]

I don't know what's going wrong with the copytool here, but I did notice something odd with one of things you tried. Here it looks like you did an hsm_restore on a file in the shadow directory instead of the lustre filesytem, and in this case the error and lack of output from the copytool appear to be correct. Doesn't explain why it's not being restored in the other cases, though.

[root@n016 ~]# lfs hsm_restore /local/shadow/testfile9
Cannot read FID of /local/shadow/testfile9: No data available
Copytool is silent.
Comment by Joseph Nemeth (Inactive) [ 23/Jul/14 ]

Oops. Yes, ignore that one.

Comment by Joseph Nemeth (Inactive) [ 24/Jul/14 ]

Some additional information.

Restarting the copytool does not clear he problem. Subsequent attempts to restore the file, once it has failed this way, continue to fail in the same way.

Stopping the copytool, issuing a shutdown to the Lustre HSM coordinator, then re-enabling the coordinator and restarting the copytool, does not clear the problem.

Stopping the copytool, unmounting and remounting the Lustre FS on the client node, then restarting the copytool seems to clear the problem. It is now possible to restore the file. However, releasing the file and then attempting to restore again will (typically) hang.

There seems to be some timing issue involved. This doesn't always fail. For instance, after one remount of the FS, I was able to release/restore several times in a row, with no problems, and the copytool always showed completion of the restore opeation. Then I tried release/implicit-restore (by trying to read the file), and it hung immediately.

Comment by gaurav mahajan (Inactive) [ 27/Feb/15 ]

If we again perform md5sum bigf (restore) after the active_request_timeout (copy tool killed and restarted), restore happens properly.

After active_request_timeout, Coordinator should pick the tasks which are stopped when the copy tool was killed and start the operations again (resuming operation does not work, So starting the stopped tasks from the beginning is the better option).

But once after the active_request_timeout, previous hsm requests will be removed from the active requests and no information available about the stopped (interrupted) tasks.

So do we need one more /proc/fs/lustre entry to keep track of interrupted tasks and start again if they are not completed in specified active_request_timeout ?

Comment by gaurav mahajan (Inactive) [ 02/Mar/15 ]

Query :

If somehow new copy tool starts restore, is it not against the purpose of killing the previous copy tool ? Suppose I killed the copy tool having the intention to stop hsm_restore and new copy tool again restarts the restore process, It does not make sense right ?

Please someone resolve my query. confused.

Comment by gaurav mahajan (Inactive) [ 04/Mar/15 ]

Please review the proposed solution and guide me

Comment by gaurav mahajan (Inactive) [ 17/Mar/15 ]

http://review.whamcloud.com/14087

Comment by Frank Zago (Inactive) [ 17/Mar/15 ]

Hi Gaurav,

I think your proposed solution is not the right way to do it, because it will only work for the posix copytool, and only if it is running on the MDT.

A much better way would be for the Lustre client on which the copytool runs to tell the MDT when the copytool stops/dies. The MDT could then recycle the existing requests, which could be resent when a new copytool starts. The same thing should happen if the Lustre client itself gets evicted.

Comment by gaurav mahajan (Inactive) [ 18/Mar/15 ]

Thanks for the mighty suggestion Frank. I will look into it.

Comment by gaurav mahajan (Inactive) [ 19/May/15 ]

Below is my new approach for LU-5216. Please suggest

While un registering a copytool agent from coordinator.

1. Find the requests which are under process by particular agent using the uuid of the copy tool in the car_request_list.

2. Once the car (cdt_agent_req) of the particular copy tool agent is found, place the client details (got from ptlrpc, take md5sum of client name) in "car->car_uuid" (This hack of storing client name is required to run the stuck requests when the new copy tool comes alive on the particular client as we need to assign the requests only to copy tool running on the particular client)

While registering the copy tool to coordinator.

3. Find whether the request for registration of a copy tool agent is on a client on which copy tool got killed (compared taking md5sum of client name from ptlrpc and already stored details about the client)

4. If the request is on the same client then go through the car_request_list, get the car (cdt_agent_req), remove from the existing list and add the same request to list by by placing uuid of the new copy tool in car->car_uuid.

Comment by Frank Zago (Inactive) [ 20/May/15 ]

That won't work if the copytool on node A dies, and a new copytool is started on node B (like in a HA setup for instance). If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request. The biggest code change would be to detect when a copytool is no longer reachable/functional.

Comment by gaurav mahajan (Inactive) [ 20/May/15 ]

Yes Frank.
>> That won't work if the copytool on node A dies, and a new copytool is started on node B.
That's the reason why I wish store the md5sum of a client name into car->car_uuid (place to store uuid of the copy tool. When copy tool dies then this uuid will not give any meaning. So we shall use it as a place holder for client name). This part will help in identifying the right node to re-queue the requests when the new copy tool comes alive on the same node.

>> If the MDS is aware that a copytool dies, it should just reset the requests (or not, depending on progress already made?), and give them to the next copytool that requests them like any new request

Yes. This part is also handled. Old requests will be added to the tail of the car_request_list by placing uuid of the next copy tool in car->car_uuid (next copy tool on the same node treats the old requests as new).

>> The biggest code change would be to detect when a copytool is no longer reachable/functional.
Currently few of the signals caught by the copy tool will indicate coordinator to un register the copy tool. Need to look into it. But above logic works I guess.

Comment by Frank Zago (Inactive) [ 21/May/15 ]

But the copytool may not restart on the same node, ever. So why keep the uuid of that copytool around?

Comment by gaurav mahajan (Inactive) [ 22/May/15 ]

Okay Frank I got your point.
I was thinking about hsm storage. Even if copy tool starts again on the same copy tool there is no guarantee that same hsm storage is used. I agree that storing uuid of CT does not make any sense.

Apart from this any suggestion you want to provide or should I proceed with this solution ?

Comment by Vinayak (Inactive) [ 07/Apr/16 ]

I have submitted a patch to address this issue (also considering client eviction case). Please review and give your feedback to cover all possible scenarios.

Patch can be tracked at : http://review.whamcloud.com/#/c/19369/

Comment by Frank Zago (Inactive) [ 07/Apr/16 ]

Thanks Vinayak. I'll try to test it soon.

Comment by Frank Zago (Inactive) [ 11/Apr/16 ]

Why cancel the requests? Can't they be just reset? IMO if a copytool fails, we should not cancel those requests because that will force a user to restart the archiving requests, and all failed restore command will create issues with the application wanting to access the files.

The archive requests can be reset in the queue, and restarted later when a new copytool shows up. As for the restore requests, they have a timeout that should be respected, and they shouldn't fail just because the copytool was terminated.

Comment by Vinayak (Inactive) [ 12/Apr/16 ]

Hello Frank,

I have considered 2 cases here

1. Kill the copy tool (kill without -9)

In this case (as this issue explains) if the copy tool is killed during any hsm activity then the copy tool un-register process will wait for the on going hsm activity to finish and then un-registration continues.
Ex: md5sum on hsm_released file calls for hsm_restore of the file. During this hsm_restore process if we kill the copy tool then the restore process will get stuck in state=STARTED and in turn md5sum keeps waiting for hsm_restore to complete till the request_timeout. The patch what I have submitted will wait till this activity to finish and md5sum return smoothly.

2. If the data mover node (client) powered down - No need to wait for the request to complete

In this case the data mover node itself is down and no copy tool available at the moment to continue the activity. So I thought to un-register the copy tool and cancel all the requests running on that CT as the process waiting on the hsm action also gets killed when the powered down. So that other copy tools which are looking to do any hsm activity on the same file will be processed successfully.

Answer for your questions.
>> Why cancel the requests? Can't they be just reset?
1. reset means add the request back to the queue and process the same when the new copy tool with same archive id comes alive ? This will still make the process to keep waiting for hsm action to complete.

2. Point #1 explains this. I am not resetting the requests but waiting for them to complete which also helps returning to the process waiting on hsm activity. The order in the patch might be confusing. Earlier I was trying align the running the requests to any other copy tool with same archive id in case if the current copy tool gets killed but stopped there as it is against the present hsm design.

I would like bring one more point here. HSMA_CANCEL in copy tool is not yet implemented. Even if we send HSMA_CANCEL to the copy tool, it does nothing.

        case HSMA_CANCEL:
                CT_TRACE("cancel not implemented for file system '%s'",
                         opt.o_mnt);
                /* Don't report progress to coordinator for this cookie:
                 * the copy function will get ECANCELED when reporting
                 * progress. */
                err_minor++;
                return 0;
                break;

Many a times if CT gets killed during hsm_restore process poses lay out change problem. I mean if the file is halfway restored and CT gets killed, then we cannot really reset the request. It is quiet a bit copy tool centered and sequence of actions are controlled by copy tool.

Please correct me if wrong.

Comment by Vinayak (Inactive) [ 18/Apr/16 ]

Hello Frank,

Any updates/feed back from you ?

Thanks,

Comment by Frank Zago (Inactive) [ 18/Apr/16 ]

Hi Vinayak,

How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die? I think Lustre preventing a process from being killed is a bad thing, especially when kill -KILL is used. For instance that prevents a machine from being shut-down as long as a restore is in progress. This is not introduced by this fix though.

Comment by Vinayak (Inactive) [ 19/Apr/16 ]

Hello Frank,

>> 1. How can your fix be tested if the restore action completes all the time because Lustre refuse to let the copytool die?
In case of normal kill (machine still up), we need to wait for the completion of hsm actions otherwise the process waiting on them will get stuck. Ex: md5sum on archived and released file will trigger restore action.

>> 2. For instance that prevents a machine from being shut-down as long as a restore is in progress.
Force shutdown will cause client eviction and the actions running on that CT on that node will be cancelled and lay out lock will be released in case of hsm restore.

I agree that reset the actions is better than cancelling them in case of client eviction (2nd case) but it requires lot of things to be done mainly in case of restore.
1. If restore is half way finished and CT gets killed. IMO, we cannot simply put the request into the queue so that other CT can pick up. Restore results in layout changes, for this half way restored file should be cleaned from the lustre storage and then we can re-queue that.

2. If there is no CT running on any client then request will stay in the queue until new CT with the same archive ID comes.

3. I am not sure about this case though I am giving my thoughts. If new CT comes online but with different archive ID then hsm request for the same file go into WAITING state as already there is action in the list.

Please put your thoughts forward I will improvise the patch.

Comment by Gerrit Updater [ 08/Dec/16 ]

Vinayak (vinayakswami.hariharmath@seagate.com) uploaded a new patch: https://review.whamcloud.com/24238
Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c2c9531d7005b63b1c8afd64c2d5ed22c404526e

Comment by Jean-Baptiste Riaux (Inactive) [ 25/Sep/17 ]

Patch https://review.whamcloud.com/#/c/24238/ rebased on master.

Comment by Jean-Baptiste Riaux (Inactive) [ 25/Sep/17 ]

Stared to test with current master release (patch #24238 not applied) as lot of HSM code landed since last year tests.

For a single node copytool setup, same behavior observed as before:

[root@client ~]# dd if=/dev/urandom of=/lustre/testfs/testfile bs=$((1024*1024)) count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 62.3971 s, 17.2 MB/s
[root@client ~]# lfs hsm_archive /lustre/testfs/testfile
[root@client ~]# lfs hsm_release /lustre/testfs/testfile
[root@client ~]# md5sum /lustre/testfs/testfile
# while md5sum has been started, kill the copytool (restoring)
[root@client2 ~]# lhsmtool_posix  --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs
1506345768.797131 lhsmtool_posix[2218]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
1506345768.801292 lhsmtool_posix[2218]: waiting for message from kernel
1506345788.128256 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1
1506345788.128356 lhsmtool_posix[2218]: waiting for message from kernel
1506345788.128407 lhsmtool_posix[2219]: '[0x200000402:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x59c8fec7
1506345788.129443 lhsmtool_posix[2219]: processing file 'testfile'
1506345788.182456 lhsmtool_posix[2219]: archiving '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
1506345788.184770 lhsmtool_posix[2219]: saving stripe info of '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' in /lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp.lov
1506345788.186209 lhsmtool_posix[2219]: start copy of 1073741824 bytes from '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
1506345802.241652 lhsmtool_posix[2219]: copied 1073741824 bytes in 14.056800 seconds
1506345802.711506 lhsmtool_posix[2219]: data archiving for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' to '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' done
1506345802.711621 lhsmtool_posix[2219]: attr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
1506345802.712621 lhsmtool_posix[2219]: fsetxattr of 'trusted.hsm' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
1506345802.712645 lhsmtool_posix[2219]: fsetxattr of 'trusted.link' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
1506345802.712660 lhsmtool_posix[2219]: fsetxattr of 'trusted.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
1506345802.712673 lhsmtool_posix[2219]: fsetxattr of 'trusted.lma' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=0 (Success)
1506345802.712684 lhsmtool_posix[2219]: fsetxattr of 'lustre.lov' on '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp' rc=-1 (Operation not supported)
1506345802.712688 lhsmtool_posix[2219]: xattr file for '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' saved to archive '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0_tmp'
1506345802.713441 lhsmtool_posix[2219]: symlink '/lustre/HSM//shadow/testfile' to '../0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' done
1506345802.713902 lhsmtool_posix[2219]: Action completed, notifying coordinator cookie=0x59c8fec7, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0
1506345802.714923 lhsmtool_posix[2219]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0)
1506345837.128875 lhsmtool_posix[2218]: copytool fs=testfs archive#=1 item_count=1
1506345837.128965 lhsmtool_posix[2218]: waiting for message from kernel
1506345837.128998 lhsmtool_posix[2223]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec8
1506345837.130409 lhsmtool_posix[2223]: processing file 'testfile'
1506345837.130450 lhsmtool_posix[2223]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0'
1506345837.139686 lhsmtool_posix[2223]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]'
1506345837.145800 lhsmtool_posix[2223]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x3:0x0]'
^Cexiting: Interrupt

#Restart copytool
[root@client2 ~]# lhsmtool_posix  --hsm-root /lustre/HSM/ --archive=1 /lustre/testfs
1506345854.943013 lhsmtool_posix[2224]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
1506345854.944935 lhsmtool_posix[2224]: waiting for message from kernel

#md5sum process hangs until active_request_timeout is reached (unkillable process)
[root@client testfs]# cat /proc/2731/stack
[<ffffffffa08b7791>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc]
[<ffffffffa08b9448>] ldlm_cli_enqueue_fini+0x938/0xdb0 [ptlrpc]
[<ffffffffa08bc3b2>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
[<ffffffffa0afc1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc]
[<ffffffffa0afe64b>] mdc_intent_lock+0x26b/0x520 [mdc]
[<ffffffffa0b2f09f>] lmv_intent_lock+0x5cf/0x1b50 [lmv]
[<ffffffffa0b6eec3>] ll_layout_refresh+0x503/0xfb0 [lustre]
[<ffffffffa0bb132b>] vvp_io_fini+0xfb/0x2c0 [lustre]
[<ffffffffa06d5275>] cl_io_fini+0x75/0x270 [obdclass]
[<ffffffffa0b61a5b>] ll_file_io_generic+0x41b/0xb50 [lustre]
[<ffffffffa0b62a47>] ll_file_aio_read+0x347/0x3e0 [lustre]
[<ffffffffa0b62bbb>] ll_file_read+0xdb/0x3d0 [lustre]
[<ffffffff811fdf9e>] vfs_read+0x9e/0x170
[<ffffffff811feb6f>] SyS_read+0x7f/0xe0
[<ffffffff81696709>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

#At the 2mn mark
[root@client ~]# md5sum /lustre/testfs/testfile
md5sum: /lustre/testfs/testfile: No data available

Copytool ended restoring the file after the 2mn mark
1506346268.134941 lhsmtool_posix[2224]: copytool fs=testfs archive#=1 item_count=1
1506346268.135206 lhsmtool_posix[2224]: waiting for message from kernel
1506346268.135286 lhsmtool_posix[2249]: '[0x200000402:0x4:0x0]' action RESTORE reclen 72, cookie=0x59c8fec9
1506346268.138556 lhsmtool_posix[2249]: processing file 'testfile'
1506346268.138618 lhsmtool_posix[2249]: reading stripe rules from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0.lov' for '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0'
1506346268.149043 lhsmtool_posix[2249]: restoring data from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]'
1506346268.154331 lhsmtool_posix[2249]: start copy of 1073741824 bytes from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]'
1506346279.502113 lhsmtool_posix[2249]: copied 1073741824 bytes in 11.350319 seconds
1506346279.502147 lhsmtool_posix[2249]: data restore from '/lustre/HSM//0004/0000/0402/0000/0002/0000/0x200000402:0x4:0x0' to '{VOLATILE}=[0x200000403:0x4:0x0]' done
1506346279.502152 lhsmtool_posix[2249]: Action completed, notifying coordinator cookie=0x59c8fec9, FID=[0x200000402:0x4:0x0], hp_flags=0 err=0
1506346279.550123 lhsmtool_posix[2249]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x4:0x0' ok (rc=0)

Test with two nodes copytool setup: the copytool started on second node after killing the first instance on primary node never restored the file.
I had to wait until md5sum failed (active_request_timeout) and retrigger md5sum to restore the file.

[root@client testfs]# lfs hsm_release testfile
[root@client testfs]# md5sum testfile

# killing copytool while restoring
[root@client2 HSM]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
1506348240.935844 lhsmtool_posix[2304]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
1506348240.939058 lhsmtool_posix[2304]: waiting for message from kernel
1506348298.780877 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1
1506348298.780976 lhsmtool_posix[2304]: waiting for message from kernel
1506348298.781022 lhsmtool_posix[2305]: '[0x200000402:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x59c90b5a
1506348298.782327 lhsmtool_posix[2305]: processing file 'testfile'
[...]
1506348317.463070 lhsmtool_posix[2305]: Action completed, notifying coordinator cookie=0x59c90b5a, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0
1506348317.463947 lhsmtool_posix[2305]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0)
1506348507.782112 lhsmtool_posix[2304]: copytool fs=testfs archive#=1 item_count=1
1506348507.783689 lhsmtool_posix[2304]: waiting for message from kernel
1506348507.783731 lhsmtool_posix[2334]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5b
1506348507.785689 lhsmtool_posix[2334]: processing file 'testfile'
1506348507.785800 lhsmtool_posix[2334]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0'
1506348507.799790 lhsmtool_posix[2334]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]'
1506348507.805419 lhsmtool_posix[2334]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000403:0x1:0x0]'
^Cexiting: Interrupt

 # starting another copytool 

[root@client2 HSM]# ssh client3
Last login: Mon Sep 25 16:02:17 2017 from client2
[root@client3 ~]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
1506348526.396743 lhsmtool_posix[2289]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
1506348526.404385 lhsmtool_posix[2289]: waiting for message from kernel

# actions on the MDS:
[root@mds1 lustre]# cat mdt/*/hsm/actions
lrh=[type=10680000 len=136 idx=1/3] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x59c90a8c/0x59c90b5b action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=CANCELED data=[]

# after the 2mn mark
[root@client testfs]# md5sum testfile
md5sum: testfile: No data available

# then nothing happened on the copytool (action CANCELED)
# retrigger md5sum on testfile works (restored)

[root@client testfs]# md5sum testfile
f1f873e4cad553d692cfefa93ff381a9  testfile
1506348841.783293 lhsmtool_posix[2289]: copytool fs=testfs archive#=1 item_count=1
1506348841.783481 lhsmtool_posix[2289]: waiting for message from kernel
1506348841.783565 lhsmtool_posix[2290]: '[0x200000402:0x1:0x0]' action RESTORE reclen 72, cookie=0x59c90b5c
1506348841.786109 lhsmtool_posix[2290]: processing file 'testfile'
1506348841.786163 lhsmtool_posix[2290]: reading stripe rules from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0.lov' for '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0'
1506348841.820678 lhsmtool_posix[2290]: restoring data from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]'
1506348841.826554 lhsmtool_posix[2290]: start copy of 1073741824 bytes from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]'
1506348853.333816 lhsmtool_posix[2290]: copied 1073741824 bytes in 11.509827 seconds
1506348853.333871 lhsmtool_posix[2290]: data restore from '/storage/HSM//0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0' to '{VOLATILE}=[0x200000404:0x1:0x0]' done
1506348853.334950 lhsmtool_posix[2290]: Action completed, notifying coordinator cookie=0x59c90b5c, FID=[0x200000402:0x1:0x0], hp_flags=0 err=0
1506348853.405275 lhsmtool_posix[2290]: llapi_hsm_action_end() on '/lustre/testfs/.lustre/fid/0x200000402:0x1:0x0' ok (rc=0)

So it's same as Frank said MDS should have restarted the restore operation when the new copytool registered.

Now testing with patch #24238.

Comment by Jean-Baptiste Riaux (Inactive) [ 25/Sep/17 ]

Tests with patch #24238 applied

Single node setup, normal kill (no eviction).

[root@client testfs]# lfs hsm_archive testfile
[root@client testfs]# lfs hsm_release testfile
[root@client testfs]# md5sum testfile #killing copytool at this point and restarting it
md5sum: testfile: No data available

As specified before, in case of normal kill (server still online, no eviction): wait until the HSM actions complete (active_request_timeout reached) so processes waiting on hsm actions can end.

Two CT nodes setup, abrupt shutdown + starting new copytool on another node (active_request_timeout set to 600s).

[root@client2 testfs]# dd if=/dev/urandom of=./testfile4 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 62.7363 s, 17.1 MB/s
[root@client2 testfs]# lfs hsm_archive testfile4
[root@client2 testfs]# lfs hsm_release testfile4
[root@client2 testfs]# md5sum testfile4

On copytool:
1506354374.424365 lhsmtool_posix[2150]: '[0x200000404:0x3:0x0]' action RESTORE reclen 72, cookie=0x59c91bb3
1506354374.426135 lhsmtool_posix[2150]: processing file 'testfile4'
1506354374.426183 lhsmtool_posix[2150]: reading stripe rules from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0.lov' for '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0'
1506354374.443349 lhsmtool_posix[2150]: restoring data from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]'
1506354374.449763 lhsmtool_posix[2150]: start copy of 1073741824 bytes from '/storage/HSM//0003/0000/0404/0000/0002/0000/0x200000404:0x3:0x0' to '{VOLATILE}=[0x200000405:0x1:0x0]'

Abrupt shutdown of client running copytool.
Starting another copytool on another client.

[root@client3 ~]# lhsmtool_posix  --hsm-root /storage/HSM/ --archive=1 /lustre/testfs
1506354389.090785 lhsmtool_posix[53551]: action=0 src=(null) dst=(null) mount_point=/lustre/testfs
1506354389.094404 lhsmtool_posix[53551]: waiting for message from kernel

Waiting now for the MDS to evict first node than ran copytool:

Sep 25 17:49:56 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client d90b0949-266c-b31e-93a6-e0ea0af9532c (at 192.168.10.131@tcp) in 234 seconds. I think it's dead, and I am evicting it. exp ffff8800b245dc00, cur 1506354596 expire 1506354446 last 1506354362

[root@mds1 lustre]# cat mdt/*/hsm/actions
lrh=[type=10680000 len=136 idx=1/8] fid=[0x200000404:0x3:0x0] dfid=[0x200000404:0x3:0x0] compound/cookie=0x59c91b6d/0x59c91bb3 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]

Restore is still started and will timeout when active_request_timeout is reached.

Two nodes setup, two CT running at start.

[root@client2 testfs]# lfs hsm_archive testfile5
[root@client2 testfs]# lfs hsm_release testfile5
[root@client2 testfs]# md5sum testfile5

# Abrupt shutdown on the node restoring.
# MDS detecting failure
Sep 25 18:02:50 mds1 kernel: LustreError: 77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (d90b0949-266c-b31e-93a6-e0ea0af9532c) not found, unregistering: rc = -2
# Client evicted
Sep 25 18:09:12 mds1 kernel: Lustre: MGS: haven't heard from client f23e7274-1b60-37b1-a634-1cbc55f4c8c6 (at 192.168.10.139@tcp) in 232 seconds. I think it's dead, and I am evicting it. exp ffff880046b74400, cur 1506355752 expire 1506355602 last 1506355520
Sep 25 18:09:19 mds1 kernel: Lustre: testfs-MDT0001: haven't heard from client be0c72ea-cfac-a42f-35a4-2fe303653d88 (at 192.168.10.139@tcp) in 239 seconds. I think it's dead, and I am evicting it. exp ffff880046b75800, cur 1506355759 expire 1506355609 last 1506355520
[root@mds1 lustre]# cat mdt/*/hsm/actions
lrh=[type=10680000 len=136 idx=1/11] fid=[0x200000404:0x5:0x0] dfid=[0x200000404:0x5:0x0] compound/cookie=0x59c91b70/0x59c91bb6 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]

But still had to wait the timeout for the process to end (and other requests to same file from different clients work).
Checking the logs I see that the action is canceled before the agent is unregistered (as specified in point 1 of commit message).

00000004:20000000:0.0:1506356124.870413:0:77432:0:(mdt_coordinator.c:1565:mdt_hsm_update_request_state()) Not updating record: fid=[0x200000404:0x5:0x0] cookie=0x59c91bb6 action=RESTORE status=FAILED
00000004:00020000:0.0:1506356125.870598:0:77432:0:(mdt_hsm_cdt_agent.c:573:mdt_hsm_agent_send()) testfs-MDT0000: agent uuid (be0c72ea-cfac-a42f-35a4-2fe303653d88) not found, unregistering: rc = -2
00000004:00000001:0.0:1506356125.874145:0:77432:0:(mdt_hsm_cdt_agent.c:198:mdt_hsm_agent_unregister()) Process entered
00000004:00000010:0.0:1506356125.874151:0:77432:0:(mdt_hsm_cdt_agent.c:217:mdt_hsm_agent_unregister()) kfreed 'ha->ha_archive_id': 4 at ffff880046585990.
00000004:00000010:0.0:1506356125.874155:0:77432:0:(mdt_hsm_cdt_agent.c:218:mdt_hsm_agent_unregister()) kfreed 'ha': 80 at ffff880108d89480.
00000004:00000001:0.0:1506356125.874158:0:77432:0:(mdt_hsm_cdt_agent.c:220:mdt_hsm_agent_unregister()) Process leaving via out (rc=0 : 0 : 0x0)
00000004:20000000:0.0:1506356125.874160:0:77432:0:(mdt_hsm_cdt_agent.c:222:mdt_hsm_agent_unregister()) agent be0c72ea-cfac-a42f-35a4-2fe303653d88 unregistration: 0

Needs to do one last test: HA setup with normal kill (no abrupt shutdown, point 2 of commit message).

Comment by Sergey Cheremencev [ 25/Oct/17 ]

Hello !

Vinayak will not work on this ticket anymore.

Please let me know if I can help with review or other things to move it forward.

 

 

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24238/
Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 462c7aae05dfc9cd730f44ffdc661c4c36294012

Comment by Peter Jones [ 17/Dec/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 20/Dec/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615
Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a57dcdbca18f9721169f06c312519c619856a656

Comment by Gerrit Updater [ 20/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/
Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: be4507fb45074ad24208c494f98a00da90b13665

Comment by Gerrit Updater [ 31/Jan/18 ]

Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105
Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

Comment by Jean-Baptiste Riaux (Inactive) [ 31/Jan/18 ]

New patch (https://review.whamcloud.com/#/c/31105/) available including only first part of previous patch (https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/).

This patch keeps only the first fix:
Unexpected client (data mover node) eviction could
cause on going hsm requests to be stuck in "STARTED"
state as the copy tool running on the data mover node
is not available anymore and requests could not be
finished. This patch unregisters the copy tool and
cancels all the requests on the copytool's agent.

Comment by Cory Spitz [ 06/Mar/18 ]

pjones, I'd like to get this on the 2.11.0 radar. May we set the Fix Version? bevans & sergey from Cray will follow with more to better kick-start the conversation.

Comment by Peter Jones [ 06/Mar/18 ]

spitzcor why don't we discuss at the upcoming LWG meeting?

Comment by Cory Spitz [ 09/Mar/18 ]

We discussed this at the 3/8 LWG. JB, we'll try to help you land your patch and work out any issues with John H. Thanks.

Comment by Sergey Cheremencev [ 11/Apr/18 ]

Could someone clarify the question about the second part of patch ?
Do you have plans for land it ?

Right now the 1nd part of patch (https://review.whamcloud.com/#/c/31105/) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed.

If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch.

Please let me know your position.

Thanks Sergey

Comment by Ben Evans (Inactive) [ 17/Apr/18 ]

Sergey, I've got a patch worked up for the second part of the patch, but it depends on the first.  As soon as the first lands, I'll push the second one.

Generated at Sat Feb 10 01:49:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.