[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Labels:
- HSM
- cea

Severity:
3
Rank (Obsolete):
13001

Description

This is easy to reproduce. I hit this problem every time when I trying to run following commands.

rm /mnt/lustre/XXXX -f;
echo XXX > /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_release /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX; # This will restore automatically
lfs hsm_release /mnt/lustre/XXXX;
lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
cat /mnt/lustre/XXXX; # this will stuck

And after some time, following messages shew up.

INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-lustre- D 0000000000000005 0 4106 2 0x00000080
ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
Call Trace:
[<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814fc89b>] mutex_lock+0x2b/0x50
[<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
[<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
[<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
[<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
[<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
[<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
[<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
[<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
[<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
[<ffffffff81128d81>] do_writepages+0x21/0x40
[<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
[<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
[<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
[<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
[<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
[<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
[<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81137a66>] bdi_start_fn+0x86/0x100
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.

Attachments

Issue Links

is duplicated by

LU-5196 HSM: client task stuck waiting for mutex in ll_layout_refresh

Resolved

is related to

LUDOC-252 Copytool Recommendations - Add/Clarify

Open

LU-4728 NULL pointer dereference in ldlm_cli_enqueue_local when enabling hsm_control after LU-4727 happends

Resolved

LU-6460 LLIF_FILE_RESTORING is not cleared at end of restore

Resolved

LU-4002 HSM restore vs unlink deadlock

Resolved

mentioned in: Page Loading...

(1 mentioned in)

Activity

[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

Li Xi (Inactive) added a comment - 06/May/15 4:22 PM - edited

Hi John,

Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that ~~LU-4727~~ is fixed for me too.

I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone.

However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks!

Li Xi (Inactive) added a comment - 06/May/15 4:22 PM - edited Hi John, Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that LU-4727 is fixed for me too. I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone. However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks!

John Hammond added a comment - 06/May/15 2:26 PM

> Hmm, I still got following error on recent master branch with 13750 patch.
> ...

Please post your reproducer.

John Hammond added a comment - 06/May/15 2:26 PM > Hmm, I still got following error on recent master branch with 13750 patch. > ... Please post your reproducer.

Li Xi (Inactive) added a comment - 06/May/15 11:16 AM

Hmm, I still got following error on recent master branch with 13750 patch.

May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds.
May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1
May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080
May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0
May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980
May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8
May 6 19:13:44 server1 kernel: Call Trace:
May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc]
May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180
May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50
May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340
May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0
May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20
May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90
May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160
May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140
May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30
May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Li Xi (Inactive) added a comment - 06/May/15 11:16 AM Hmm, I still got following error on recent master branch with 13750 patch. May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds. May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1 May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080 May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0 May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980 May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8 May 6 19:13:44 server1 kernel: Call Trace: May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc] May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180 May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50 May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre] May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre] May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre] May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340 May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0 May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20 May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90 May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160 May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140 May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30 May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Peter Jones added a comment - 23/Apr/15 4:43 PM

This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

Peter Jones added a comment - 23/Apr/15 4:43 PM This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

Jinshan Xiong (Inactive) added a comment - 24/Mar/15 3:55 PM

Yes, the patch can fix this problem

Jinshan Xiong (Inactive) added a comment - 24/Mar/15 3:55 PM Yes, the patch can fix this problem

Frank Zago (Inactive) added a comment - 23/Mar/15 7:26 PM

The process is stat, which does a stat() on a file.
Is it the problem your patch should fix, or is it fixing something else?

Frank Zago (Inactive) added a comment - 23/Mar/15 7:26 PM The process is stat, which does a stat() on a file. Is it the problem your patch should fix, or is it fixing something else?

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 07/Mar/14 3:32 AM

Updated:: 25/Jan/22 8:54 PM

Resolved:: 23/Apr/15 4:44 PM