Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.5.1
-
3
-
13001
Description
This is easy to reproduce. I hit this problem every time when I trying to run following commands.
rm /mnt/lustre/XXXX -f;
echo XXX > /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_release /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX; # This will restore automatically
lfs hsm_release /mnt/lustre/XXXX;
lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
cat /mnt/lustre/XXXX; # this will stuck
And after some time, following messages shew up.
INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-lustre- D 0000000000000005 0 4106 2 0x00000080
ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
Call Trace:
[<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814fc89b>] mutex_lock+0x2b/0x50
[<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
[<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
[<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
[<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
[<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
[<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
[<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
[<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
[<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
[<ffffffff81128d81>] do_writepages+0x21/0x40
[<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
[<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
[<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
[<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
[<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
[<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
[<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81137a66>] bdi_start_fn+0x86/0x100
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.
Attachments
Issue Links
- is duplicated by
-
LU-5196 HSM: client task stuck waiting for mutex in ll_layout_refresh
-
- Resolved
-
- is related to
-
LUDOC-252 Copytool Recommendations - Add/Clarify
-
- Open
-
-
LU-4728 NULL pointer dereference in ldlm_cli_enqueue_local when enabling hsm_control after LU-4727 happends
-
- Resolved
-
-
LU-6460 LLIF_FILE_RESTORING is not cleared at end of restore
-
- Resolved
-
-
LU-4002 HSM restore vs unlink deadlock
-
- Resolved
-
- mentioned in
-
Page Loading...
What's this process? From the name I guess it's not copy tool. John's patch can only fix copy tool case.
You will need patch 13138 too to address this case if the restoring will take longer than 120 seconds.