Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.5.1
-
3
-
13001
Description
This is easy to reproduce. I hit this problem every time when I trying to run following commands.
rm /mnt/lustre/XXXX -f;
echo XXX > /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_release /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX; # This will restore automatically
lfs hsm_release /mnt/lustre/XXXX;
lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
cat /mnt/lustre/XXXX; # this will stuck
And after some time, following messages shew up.
INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-lustre- D 0000000000000005 0 4106 2 0x00000080
ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
Call Trace:
[<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814fc89b>] mutex_lock+0x2b/0x50
[<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
[<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
[<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
[<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
[<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
[<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
[<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
[<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
[<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
[<ffffffff81128d81>] do_writepages+0x21/0x40
[<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
[<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
[<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
[<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
[<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
[<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
[<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81137a66>] bdi_start_fn+0x86/0x100
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.
Attachments
Issue Links
- is duplicated by
-
LU-5196 HSM: client task stuck waiting for mutex in ll_layout_refresh
-
- Resolved
-
- is related to
-
LUDOC-252 Copytool Recommendations - Add/Clarify
-
- Open
-
-
LU-4728 NULL pointer dereference in ldlm_cli_enqueue_local when enabling hsm_control after LU-4727 happends
-
- Resolved
-
-
LU-6460 LLIF_FILE_RESTORING is not cleared at end of restore
-
- Resolved
-
-
LU-4002 HSM restore vs unlink deadlock
-
- Resolved
-
- mentioned in
-
Page Loading...
Hmm, I still got following error on recent master branch with 13750 patch.
May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds.
May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1
May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080
May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0
May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980
May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8
May 6 19:13:44 server1 kernel: Call Trace:
May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc]
May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180
May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50
May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340
May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0
May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20
May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90
May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160
May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140
May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30
May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b