Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.5.1
-
Centos 6.5, Lustre 2.5.56
-
3
-
14550
Description
When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.
For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:
# dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000 # lfs hsm_archive /mnt/lustre/bigf # lfs hsm_release /mnt/lustre/bigf # sleep 5https://jira.hpdd.intel.com/browse/LU-5216# # md5sum /mnt/lustre/bigf
During the restoration, kill the copytool, so no complete event is sent to the MDS.
Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.
When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:
# cat /proc/1675/stack [<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre] [<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre] [<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass] [<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass] [<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass] [<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre] [<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81189365>] vfs_read+0xb5/0x1a0 [<ffffffff811894a1>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
It is also unkillable.
If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:
# cat /proc/1723/stack [<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] [<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc] [<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv] [<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv] [<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre] [<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre] [<ffffffff8119d802>] vfs_ioctl+0x22/0xa0 [<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580 [<ffffffff8119df21>] sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
However lfs hsm operation still works on other files.
Could someone clarify the question about the second part of patch ?
Do you have plans for land it ?
Right now the 1nd part of patch (https://review.whamcloud.com/#/c/31105/) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed.
If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch.
Please let me know your position.
Thanks Sergey