[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Labels:
- HSM
- cea

Severity:
3
Rank (Obsolete):
13001

Description

This is easy to reproduce. I hit this problem every time when I trying to run following commands.

rm /mnt/lustre/XXXX -f;
echo XXX > /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_release /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX; # This will restore automatically
lfs hsm_release /mnt/lustre/XXXX;
lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
cat /mnt/lustre/XXXX; # this will stuck

And after some time, following messages shew up.

INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-lustre- D 0000000000000005 0 4106 2 0x00000080
ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
Call Trace:
[<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814fc89b>] mutex_lock+0x2b/0x50
[<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
[<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
[<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
[<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
[<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
[<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
[<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
[<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
[<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
[<ffffffff81128d81>] do_writepages+0x21/0x40
[<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
[<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
[<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
[<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
[<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
[<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
[<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81137a66>] bdi_start_fn+0x86/0x100
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.

Attachments

Issue Links

is duplicated by

LU-5196 HSM: client task stuck waiting for mutex in ll_layout_refresh

Resolved

is related to

LUDOC-252 Copytool Recommendations - Add/Clarify

Open

LU-4728 NULL pointer dereference in ldlm_cli_enqueue_local when enabling hsm_control after LU-4727 happends

Resolved

LU-6460 LLIF_FILE_RESTORING is not cleared at end of restore

Resolved

LU-4002 HSM restore vs unlink deadlock

Resolved

mentioned in: Page Loading...

(1 mentioned in)

Activity

[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

Gerrit Updater added a comment - 12/Feb/15 7:59 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750
Subject: ~~LU-4727~~ hsm: use IOC_MDC_GETFILEINFO in restore
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8ec6354ded37f3e1f39d6e0336c9e17b1a97785b

Gerrit Updater added a comment - 12/Feb/15 7:59 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750 Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8ec6354ded37f3e1f39d6e0336c9e17b1a97785b

Jinshan Xiong (Inactive) added a comment - 05/Feb/15 5:28 PM

the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh.

Jinshan Xiong (Inactive) added a comment - 05/Feb/15 5:28 PM the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh.

Vinayak Hariharmath (Inactive) added a comment - 19/Dec/14 9:40 AM

http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan

Vinayak Hariharmath (Inactive) added a comment - 19/Dec/14 9:40 AM http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan

Jinshan Xiong (Inactive) added a comment - 19/Dec/14 4:26 AM

Please try patch 13138 and check if it can fix the problem.

Jinshan Xiong (Inactive) added a comment - 19/Dec/14 4:26 AM Please try patch 13138 and check if it can fix the problem.

Gerrit Updater added a comment - 19/Dec/14 4:26 AM

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138
Subject: ~~LU-4727~~ hsm: flush UPDATE lock for restore
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bf2e4b958f60cb7eda9303ad0c079fd23ff2d16b

Gerrit Updater added a comment - 19/Dec/14 4:26 AM Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138 Subject: LU-4727 hsm: flush UPDATE lock for restore Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bf2e4b958f60cb7eda9303ad0c079fd23ff2d16b

Jinshan Xiong (Inactive) added a comment - 18/Dec/14 11:57 PM - edited

I'm thinking about a solution for this problem.

When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time?

By "flush UPDATE lock", I meant to acquire the UPDATE lock and release it immediately.

Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already.

that means layout lock to take in the function, i.e., the code

                        mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
                        obj = mdt_object_find_lock(mti, &crh->crh_fid,
                                                   &crh->crh_lh,
                                                   MDS_INODELOCK_LAYOUT);

Jinshan Xiong (Inactive) added a comment - 18/Dec/14 11:57 PM - edited I'm thinking about a solution for this problem. When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time? By "flush UPDATE lock", I meant to acquire the UPDATE lock and release it immediately. Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already. that means layout lock to take in the function, i.e., the code mdt_lock_reg_init(&crh->crh_lh, LCK_EX); obj = mdt_object_find_lock(mti, &crh->crh_fid, &crh->crh_lh, MDS_INODELOCK_LAYOUT);

Patrick Farrell (Inactive) added a comment - 12/Aug/14 7:20 PM

Jinshan - Looking at your description of a possible solution...
"So it requests LAYOUT lock and then add the request into a global list, we should change it to:
1. add to global list
2. flush UPDATE lock
3. request LAYOUT lock"

When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time?

Also, this comment at the top of the function is confusing me:
" * in case of restore, caller must hold layout lock"
Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already.

Patrick Farrell (Inactive) added a comment - 12/Aug/14 7:20 PM Jinshan - Looking at your description of a possible solution... "So it requests LAYOUT lock and then add the request into a global list, we should change it to: 1. add to global list 2. flush UPDATE lock 3. request LAYOUT lock" When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time? Also, this comment at the top of the function is confusing me: " * in case of restore, caller must hold layout lock" Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already.

Li Xi (Inactive) added a comment - 19/May/14 2:24 AM

Hi all,

Is there any progress in this issue? This issue is really annoying when I am testing HSM. Is there any easy way to walk around it at least? Using a dedicated mount point for the copytool is not helping....

Thanks!

Li Xi (Inactive) added a comment - 19/May/14 2:24 AM Hi all, Is there any progress in this issue? This issue is really annoying when I am testing HSM. Is there any easy way to walk around it at least? Using a dedicated mount point for the copytool is not helping.... Thanks!

Robert Read added a comment - 11/Mar/14 7:14 PM

I believe the copytool must be run as root, but it can be run on any client. In my case only the copytool process was hung and unkillable. It also prevents the file in question from being restored, at least until the coordinator times out the action request and sends it to another copytool.

Robert Read added a comment - 11/Mar/14 7:14 PM I believe the copytool must be run as root, but it can be run on any client. In my case only the copytool process was hung and unkillable. It also prevents the file in question from being restored, at least until the coordinator times out the action request and sends it to another copytool.

Andreas Dilger added a comment - 11/Mar/14 5:24 PM

Robert, are there any restrictions on using this HSM API (e.g. capability needed, only on dedicated agent nodes set up by the sysadmin)? Otherwise, it seems like a potential problem for bad users to be able to lock up the system. Also, what is the extent of the problem? Is it only this one process that is hung (an acceptable loss for a self-inflicted problem) or does it affect the whole client, or even the MDS?

Andreas Dilger added a comment - 11/Mar/14 5:24 PM Robert, are there any restrictions on using this HSM API (e.g. capability needed, only on dedicated agent nodes set up by the sysadmin)? Otherwise, it seems like a potential problem for bad users to be able to lock up the system. Also, what is the extent of the problem? Is it only this one process that is hung (an acceptable loss for a self-inflicted problem) or does it affect the whole client, or even the MDS?

Robert Read added a comment - 07/Mar/14 10:40 PM

It turns out my issue was self-inflicted. (In my version of the copytool I had neglected to call flush() before calling llapi_hsm_action_end(), and this left data in the file pointer buffers that wasn't flushed until after the end call when I closed my file handle. So either you need to flush or just close the volatile file handle before calling end. But that is unrelated to this bug.)

Robert Read added a comment - 07/Mar/14 10:40 PM It turns out my issue was self-inflicted. (In my version of the copytool I had neglected to call flush() before calling llapi_hsm_action_end(), and this left data in the file pointer buffers that wasn't flushed until after the end call when I closed my file handle. So either you need to flush or just close the volatile file handle before calling end. But that is unrelated to this bug.)

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 07/Mar/14 3:32 AM

Updated:: 25/Jan/22 8:54 PM

Resolved:: 23/Apr/15 4:44 PM