[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring Created: 07/Mar/14 Updated: 25/Jan/22 Resolved: 23/Apr/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Li Xi (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM, cea | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 13001 | ||||||||||||||||||||||||||||
| Description |
|
This is easy to reproduce. I hit this problem every time when I trying to run following commands. rm /mnt/lustre/XXXX -f; And after some time, following messages shew up. INFO: task flush-lustre-1:4106 blocked for more than 120 seconds. It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects. |
| Comments |
| Comment by Robert Read (Inactive) [ 07/Mar/14 ] |
|
I'm seeing a similar hang here. My copytool stack: Call Trace: [<ffffffff81058d53>] ? __wake_up+0x53/0x70 [<ffffffffa0d2ec00>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] [<ffffffffa0d334a5>] ldlm_completion_ast+0x545/0x920 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa0d2d9f6>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc] [<ffffffffa0d2e2d5>] ldlm_cli_enqueue+0x3a5/0x790 [ptlrpc] [<ffffffffa0d32f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa0b30240>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa03bcd3e>] mdc_enqueue+0x2be/0x1a10 [mdc] [<ffffffffa037da84>] lmv_enqueue+0x2f4/0xfc0 [lmv] [<ffffffffa0b0b145>] ll_layout_refresh+0x515/0xfe0 [lustre] [<ffffffffa0b30240>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa0d32f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [<ffffffffa0b57113>] vvp_io_fini+0x173/0x260 [lustre] [<ffffffffa0c46cc7>] cl_io_fini+0x77/0x280 [obdclass] [<ffffffffa0afad37>] ll_file_io_generic+0xe7/0x610 [lustre] [<ffffffffa0afbeb2>] ll_file_aio_write+0x142/0x2c0 [lustre] [<ffffffffa0afc19c>] ll_file_write+0x16c/0x2a0 [lustre] [<ffffffff81188f68>] vfs_write+0xb8/0x1a0 [<ffffffff81189861>] sys_write+0x51/0x90 [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ] |
|
I believe this is broken due to a recent optimization on the MD stack. Will create a workaround patch tomorrow. |
| Comment by John Hammond [ 07/Mar/14 ] |
|
Hi Li Xi, I can only reproduce this if I run the copytool on the same mount point as used for the accesses. Can you confirm this? I strongly recommend using a dedicated mount point for the copytool. This should be somewhere in the HSM documentation. |
| Comment by Li Xi (Inactive) [ 07/Mar/14 ] |
|
Hi John, Yeah, you are correct! I wan running the copytool on the same mount point as used for the accesses. Sorry for misusing it. However, I don't understand why we have to use a dedicated mount point for the copytool. Is there any special reason for it? |
| Comment by Robert Read (Inactive) [ 07/Mar/14 ] |
|
I've seen this when running my copytool in a dedicated mount point. |
| Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ] |
|
I suspect that running copy tool on dedicated mount point can not eliminate the problem but it can surely reduce the chance of it to occur. The root cause of this problem is as follows: in mdt_getattr_internal() it checks if the file is being restored. If this is the case, it will mark it as MS_RESTORE so that the client won't do glimpse. /* if file is released, check if a restore is running */ if ((ma->ma_valid & MA_HSM) && (ma->ma_hsm.mh_flags & HS_RELEASED) && mdt_hsm_restore_is_running(info, mdt_object_fid(o))) { repbody->t_state = MS_RESTORE; repbody->valid |= OBD_MD_TSTATE; } However, when the restore request reaches the MDT, it doesn't flush UPDATE lock so if the client has UPDATE lock cached, it won't see this flag at all. The fix would be to adjust the sequence of adding RESTORE request into action list. Right now it does as follow: mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
obj = mdt_object_find_lock(mti, &crh->crh_fid,
&crh->crh_lh,
MDS_INODELOCK_LAYOUT);
if (IS_ERR(obj)) {
rc = PTR_ERR(obj);
CERROR("%s: cannot take layout lock for "
DFID": rc = %d\n", mdt_obd_name(mdt),
PFID(&crh->crh_fid), rc);
OBD_SLAB_FREE_PTR(crh, mdt_hsm_cdt_kmem);
GOTO(out, rc);
}
/* we choose to not keep a keep a reference
* on the object during the restore time which can be
* very long */
mdt_object_put(mti->mti_env, obj);
mutex_lock(&cdt->cdt_restore_lock);
list_add_tail(&crh->crh_list, &cdt->cdt_restore_hdl);
mutex_unlock(&cdt->cdt_restore_lock);
So it requests LAYOUT lock and then add the request into a global list, we should change it to: In this way, we can make sure that all clients will see the MS_RESTORE flag. Another way to fix the problem is to not getattr in copy tool process. In that case, we can't call ct_stat_by_fid() in llapi_hsm_action_begin(). The stat fetched by ct_stat_by_fid() will be used for setting UID/GID and ATIME/MTIME to volatile file. I think we can do this in other ways. Robert, as a workaround fix, I think you can disable the function call ct_stat_by_fid() in llapi_hsm_action_begin() and do some corresponding cleanup - so that you can continue your work while we're fixing the problem. John, do you remember why we should set UID/GID to the volatile file? |
| Comment by John Hammond [ 07/Mar/14 ] |
|
We do it to maintain quota. Specifically, layout swap will fail it the ownership of the two files are not equal. |
| Comment by John Hammond [ 07/Mar/14 ] |
|
Yes it would be nice if the MDT sent what was needed to the copytool. I found that it's hard to fit that extra information into the current protocol. Otherwise a backdoor stat of something would be helpful. |
| Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ] |
|
Another way is to save attribute in archive and restore it at restore time(copy tool may have already supported this, IIRC). The only problem is UID/GID setting because it may change when the file is in release state; mtime/atime won't be a problem at all. Or we can use a dedicated ioctl() to get the attribute in kernel but it won't try to glimpse the file size. |
| Comment by John Hammond [ 07/Mar/14 ] |
|
If we assume that the copytool has a dedicated mount point (and we're feeling devious) then it may be enough to do: static int ct_stat_by_fid(const struct hsm_copytool_private *ct,
const struct lu_fid *fid,
struct stat *buf)
{
char fid_name[FID_NOBRACE_LEN + 1];
snprintf(fid_name, sizeof(fid_name), DFID_NOBRACE, PFID(fid));
+ unlinkat(ct->open_by_fid_fd, fid_name, 0);
return fstatat(ct->open_by_fid_fd, fid_name, buf, 0);
}
But that's not something to be proud of. |
| Comment by Robert Read (Inactive) [ 07/Mar/14 ] |
|
It turns out my issue was self-inflicted. (In my version of the copytool I had neglected to call flush() before calling llapi_hsm_action_end(), and this left data in the file pointer buffers that wasn't flushed until after the end call when I closed my file handle. So either you need to flush or just close the volatile file handle before calling end. But that is unrelated to this bug.) |
| Comment by Andreas Dilger [ 11/Mar/14 ] |
|
Robert, are there any restrictions on using this HSM API (e.g. capability needed, only on dedicated agent nodes set up by the sysadmin)? Otherwise, it seems like a potential problem for bad users to be able to lock up the system. Also, what is the extent of the problem? Is it only this one process that is hung (an acceptable loss for a self-inflicted problem) or does it affect the whole client, or even the MDS? |
| Comment by Robert Read (Inactive) [ 11/Mar/14 ] |
|
I believe the copytool must be run as root, but it can be run on any client. In my case only the copytool process was hung and unkillable. It also prevents the file in question from being restored, at least until the coordinator times out the action request and sends it to another copytool. |
| Comment by Li Xi (Inactive) [ 19/May/14 ] |
|
Hi all, Is there any progress in this issue? This issue is really annoying when I am testing HSM. Is there any easy way to walk around it at least? Using a dedicated mount point for the copytool is not helping.... Thanks! |
| Comment by Patrick Farrell (Inactive) [ 12/Aug/14 ] |
|
Jinshan - Looking at your description of a possible solution... When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time? Also, this comment at the top of the function is confusing me: |
| Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ] |
|
I'm thinking about a solution for this problem.
By "flush UPDATE lock", I meant to acquire the UPDATE lock and release it immediately.
that means layout lock to take in the function, i.e., the code mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
obj = mdt_object_find_lock(mti, &crh->crh_fid,
&crh->crh_lh,
MDS_INODELOCK_LAYOUT);
|
| Comment by Gerrit Updater [ 19/Dec/14 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138 |
| Comment by Jinshan Xiong (Inactive) [ 19/Dec/14 ] |
|
Please try patch 13138 and check if it can fix the problem. |
| Comment by Vinayak Hariharmath (Inactive) [ 19/Dec/14 ] |
|
http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan |
| Comment by Jinshan Xiong (Inactive) [ 05/Feb/15 ] |
|
the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh. |
| Comment by Gerrit Updater [ 12/Feb/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750 |
| Comment by Gerrit Updater [ 08/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13750/ |
| Comment by Frank Zago (Inactive) [ 23/Mar/15 ] |
|
Jinshan, I got this trace when using stat on a file I was restoring (or was restored). I haven't been able to reproduce it so far. Is it the error you mention in your patch? John's patch is applied on that tree, and works well otherwise. Mar 20 15:00:55 tasclient01 kernel: INFO: task stat:951 blocked for more than 120 seconds. Mar 20 15:00:55 tasclient01 kernel: Tainted: P --------------- 2.6.32-431.17.1.el6.x86_64 #1 Mar 20 15:00:55 tasclient01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables thi s message. Mar 20 15:00:55 tasclient01 kernel: stat D 0000000000000000 0 951 29769 0x00000084 Mar 20 15:00:55 tasclient01 kernel: ffff8801d0471a58 0000000000000082 0000000000000000 000000000000000d Mar 20 15:00:55 tasclient01 kernel: 0000000000000004 ffff880237fee800 ffff880116db5610 0000000000000630 Mar 20 15:00:55 tasclient01 kernel: ffff8802143f1098 ffff8801d0471fd8 000000000000fbc8 ffff8802143f1098 Mar 20 15:00:55 tasclient01 kernel: Call Trace: Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50 Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09e009a>] ll_layout_refresh+0x1da/0xc60 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a04ab0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa06996a0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04dfe31>] ? cl_io_slice_add+0xc1/0x190 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a2d8f0>] vvp_io_init+0x340/0x490 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04df278>] cl_io_init0+0x98/0x160 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04e1ea4>] cl_io_init+0x64/0xe0 [obdclass] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a23161>] cl_glimpse_size0+0x91/0x1d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c25>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c99>] ll_getattr_it+0x49/0x170 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4df7>] ll_getattr+0x37/0x40 [lustre] Mar 20 15:00:55 tasclient01 kernel: [<ffffffff81227163>] ? security_inode_getattr+0x23/0x30 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e631>] vfs_getattr+0x51/0x80 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e6c4>] vfs_fstatat+0x64/0xa0 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e76e>] vfs_lstat+0x1e/0x20 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e794>] sys_newlstat+0x24/0x50 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810e1cc7>] ? audit_syscall_entry+0x1d7/0x200 Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ] |
|
What's this process? From the name I guess it's not copy tool. John's patch can only fix copy tool case. You will need patch 13138 too to address this case if the restoring will take longer than 120 seconds. |
| Comment by Frank Zago (Inactive) [ 23/Mar/15 ] |
|
The process is stat, which does a stat() on a file. |
| Comment by Jinshan Xiong (Inactive) [ 24/Mar/15 ] |
|
Yes, the patch can fix this problem |
| Comment by Peter Jones [ 23/Apr/15 ] |
|
This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address. |
| Comment by Li Xi (Inactive) [ 06/May/15 ] |
|
Hmm, I still got following error on recent master branch with 13750 patch. May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds. |
| Comment by John Hammond [ 06/May/15 ] |
|
> Hmm, I still got following error on recent master branch with 13750 patch. Please post your reproducer. |
| Comment by Li Xi (Inactive) [ 06/May/15 ] |
|
Hi John, Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone. However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks! |