[LU-5216] HSM: restore doesn't restart for new copytool - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.1
Labels:
- patch
Environment:
Centos 6.5, Lustre 2.5.56

Severity:
3
Rank (Obsolete):
14550

Description

When a restoration doesn't complete properly by a copytool, it is not restarted, and processes get stuck in Lustre modules.

For instance, create a large file that takes a few seconds to archive/restore, then release it, and access it:

# dd if=/dev/urandom of=/mnt/lustre/bigf bs=1M count 1000
# lfs hsm_archive /mnt/lustre/bigf
# lfs hsm_release  /mnt/lustre/bigf
# sleep 5https://jira.hpdd.intel.com/browse/LU-5216#
# md5sum /mnt/lustre/bigf

During the restoration, kill the copytool, so no complete event is sent to the MDS.

Note that at this point, it is possible the copytool is unkillable, and the only fix is to reboot the client running that copytool.

When the copytool restarts, nothing happens. The process trying to read the file is stuck (apparently forever) there:

# cat /proc/1675/stack 
[<ffffffffa09dc04c>] ll_layout_refresh+0x25c/0xfe0 [lustre]
[<ffffffffa0a28240>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa04dab68>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa04dd794>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa04debfd>] cl_io_rw_init+0x8d/0x200 [obdclass]
[<ffffffffa09cbe38>] ll_file_io_generic+0x208/0x710 [lustre]
[<ffffffffa09ccf8f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
[<ffffffffa09cd27c>] ll_file_read+0x16c/0x2a0 [lustre]
[<ffffffff81189365>] vfs_read+0xb5/0x1a0
[<ffffffff811894a1>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

It is also unkillable.

If I issue the "lfs hsm_restore bigf.bin", lfs gets stuck too:

# cat /proc/1723/stack 
[<ffffffffa06ad29a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
[<ffffffffa06ad8a7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
[<ffffffffa0868913>] mdc_iocontrol+0x2113/0x27f0 [mdc]
[<ffffffffa0af5265>] obd_iocontrol+0xe5/0x360 [lmv]
[<ffffffffa0b0c145>] lmv_iocontrol+0x1c85/0x2b10 [lmv]
[<ffffffffa09bb235>] obd_iocontrol+0xe5/0x360 [lustre]
[<ffffffffa09c64d7>] ll_dir_ioctl+0x4237/0x5dc0 [lustre]
[<ffffffff8119d802>] vfs_ioctl+0x22/0xa0
[<ffffffff8119d9a4>] do_vfs_ioctl+0x84/0x580
[<ffffffff8119df21>] sys_ioctl+0x81/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

However lfs hsm operation still works on other files.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-5216_solution.doc
15 kB
04/Mar/15 12:10 PM

Issue Links

is blocking

LU-10175 DoM:Full support for the LDLM lock convert

Resolved

is related to

LU-11284 Full lock convert conflicts with HSM

Open

LU-8905 tests: sanity-hsm test_3[3-6] does not use ps correctly

Closed

Activity

[LU-5216] HSM: restore doesn't restart for new copytool

Sergey Cheremencev added a comment - 11/Apr/18 9:37 AM

Could someone clarify the question about the second part of patch ?
Do you have plans for land it ?

Right now the 1nd part of patch (https://review.whamcloud.com/#/c/31105/) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed.

If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch.

Please let me know your position.

Thanks Sergey

Sergey Cheremencev added a comment - 11/Apr/18 9:37 AM Could someone clarify the question about the second part of patch ? Do you have plans for land it ? Right now the 1nd part of patch ( https://review.whamcloud.com/#/c/31105/ ) doesn't solve the issue with killing CopyTool on agent. Original sanity-hsm_62 that demonstrated the problem is changed beginning from patchset 3. And it is the reason why it passes - before patchset 3 it always failed. If right after the 1st part of the patch you have plans to land the 2nd I can prepare and send this patch. Suggest to add 2 sanit-hsm_62 tests - "a - Evicting a client should cancel its requests" and "b - Stopping a copytool should cancel its requests" for the 2nd part of the patch. Please let me know your position. Thanks Sergey

Cory Spitz added a comment - 09/Mar/18 4:47 PM

We discussed this at the 3/8 LWG. JB, we'll try to help you land your patch and work out any issues with John H. Thanks.

Cory Spitz added a comment - 09/Mar/18 4:47 PM We discussed this at the 3/8 LWG. JB, we'll try to help you land your patch and work out any issues with John H. Thanks.

Peter Jones added a comment - 06/Mar/18 8:59 PM

spitzcor why don't we discuss at the upcoming LWG meeting?

Peter Jones added a comment - 06/Mar/18 8:59 PM spitzcor why don't we discuss at the upcoming LWG meeting?

Cory Spitz added a comment - 06/Mar/18 8:36 PM - edited

pjones, I'd like to get this on the 2.11.0 radar. May we set the Fix Version? bevans & sergey from Cray will follow with more to better kick-start the conversation.

Cory Spitz added a comment - 06/Mar/18 8:36 PM - edited pjones , I'd like to get this on the 2.11.0 radar. May we set the Fix Version? bevans & sergey from Cray will follow with more to better kick-start the conversation.

Jean-Baptiste Riaux (Inactive) added a comment - 31/Jan/18 3:20 PM

New patch (https://review.whamcloud.com/#/c/31105/) available including only first part of previous patch (https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/).

This patch keeps only the first fix:
Unexpected client (data mover node) eviction could
cause on going hsm requests to be stuck in "STARTED"
state as the copy tool running on the data mover node
is not available anymore and requests could not be
finished. This patch unregisters the copy tool and
cancels all the requests on the copytool's agent.

Jean-Baptiste Riaux (Inactive) added a comment - 31/Jan/18 3:20 PM New patch ( https://review.whamcloud.com/#/c/31105/ ) available including only first part of previous patch ( https://review.whamcloud.com/24238/ landed and reverted https://review.whamcloud.com/30615/ ). This patch keeps only the first fix: Unexpected client (data mover node) eviction could cause on going hsm requests to be stuck in "STARTED" state as the copy tool running on the data mover node is not available anymore and requests could not be finished. This patch unregisters the copy tool and cancels all the requests on the copytool's agent.

Gerrit Updater added a comment - 31/Jan/18 3:18 PM

Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105
Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

Gerrit Updater added a comment - 31/Jan/18 3:18 PM Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: https://review.whamcloud.com/31105 Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c3d17bd9e25d5d27ba55835801798e2fdea36c49

Gerrit Updater added a comment - 20/Dec/17 4:01 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/
Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: be4507fb45074ad24208c494f98a00da90b13665

Gerrit Updater added a comment - 20/Dec/17 4:01 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30615/ Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: Commit: be4507fb45074ad24208c494f98a00da90b13665

Gerrit Updater added a comment - 20/Dec/17 3:56 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615
Subject: Revert "LU-5216 hsm: cancel hsm actions running on CT when killed"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a57dcdbca18f9721169f06c312519c619856a656

Gerrit Updater added a comment - 20/Dec/17 3:56 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30615 Subject: Revert " LU-5216 hsm: cancel hsm actions running on CT when killed" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a57dcdbca18f9721169f06c312519c619856a656

Peter Jones added a comment - 17/Dec/17 3:59 PM

Landed for 2.11

Peter Jones added a comment - 17/Dec/17 3:59 PM Landed for 2.11

Gerrit Updater added a comment - 17/Dec/17 6:18 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24238/
Subject: LU-5216 hsm: cancel hsm actions running on CT when killed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 462c7aae05dfc9cd730f44ffdc661c4c36294012

Gerrit Updater added a comment - 17/Dec/17 6:18 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24238/ Subject: LU-5216 hsm: cancel hsm actions running on CT when killed Project: fs/lustre-release Branch: master Current Patch Set: Commit: 462c7aae05dfc9cd730f44ffdc661c4c36294012

Sergey Cheremencev added a comment - 25/Oct/17 8:54 PM

Hello !

Vinayak will not work on this ticket anymore.

Please let me know if I can help with review or other things to move it forward.

Sergey Cheremencev added a comment - 25/Oct/17 8:54 PM Hello ! Vinayak will not work on this ticket anymore. Please let me know if I can help with review or other things to move it forward.

People

Assignee:: Jean-Baptiste Riaux (Inactive)

Reporter:: Frank Zago (Inactive)

Votes:: 3 Vote for this issue

Watchers:: 31 Start watching this issue

Dates

Created:: 17/Jun/14 10:14 PM

Updated:: 25/Aug/18 7:30 AM