HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3697] racing HSM release and restore vs cat leaves cat hung in ll_layout_refresh() Created: 05/Aug/13  Updated: 05/Sep/13  Resolved: 05/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Technical task Priority: Major
Reporter: John Hammond Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Rank (Obsolete): 9543

 Description   

Using 2.4.52-90-g8b8b7b3 with the patch set 12 of the coordinator main thread, I see what looks like lost agent request/missing layout lock release. To reproduce, start HSM with two client mount (/mnt/lustre and /mnt/lustre2) and one mount for the CT (/mnt/lustre-hsm). Then do

# cd /mnt/lustre
# touch f0
# lfs hsm_archive f0
# # Wait for archive to complete.
# while true; do cat f0; done
# cd /mnt/lustre
# while true; do lfs hsm_release f0; lfs hsm_restore f0; done

After a few seconds cat will hang in layout refresh, while the CT, the coordinator, and all of the mdt/ldlm threads will be idle:

q:lustre2# p-all cat
22461 cat
[<ffffffffa04e866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa07f61fa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
[<ffffffffa07f0746>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc]
[<ffffffffa07f1025>] ldlm_cli_enqueue+0x3a5/0x770 [ptlrpc]
[<ffffffffa0a6ca2e>] mdc_enqueue+0x2ce/0x1a00 [mdc]
[<ffffffffa0a1da76>] lmv_enqueue+0x336/0x1060 [lmv]
[<ffffffffa0ee1e26>] ll_layout_refresh+0x556/0x1150 [lustre]
[<ffffffffa0f2fc4b>] vvp_io_fini+0x16b/0x260 [lustre]
[<ffffffffa0f310ec>] vvp_io_read_fini+0x5c/0x70 [lustre]
[<ffffffffa06adcf7>] cl_io_fini+0x77/0x280 [obdclass]
[<ffffffffa0ed0687>] ll_file_io_generic+0xe7/0x610 [lustre]
[<ffffffffa0ed0cef>] ll_file_aio_read+0x13f/0x2c0 [lustre]
[<ffffffffa0ed158c>] ll_file_read+0x16c/0x2a0 [lustre]
[<ffffffff81182e05>] vfs_read+0xb5/0x1a0
[<ffffffff81182f41>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

q:lustre2# p-all hsm
6332 hsm_cdtr
[<ffffffffa077d641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
[<ffffffffa05ebe0c>] mdt_coordinator+0xcac/0x1820 [mdt]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

6344 lhsmtool_posix
[<ffffffff8118cc3b>] pipe_wait+0x5b/0x80
[<ffffffff8118d6e6>] pipe_read+0x3e6/0x4e0
[<ffffffff8118251a>] do_sync_read+0xfa/0x140
[<ffffffff81182e05>] vfs_read+0xb5/0x1a0
[<ffffffff81182f41>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

5792 mdt00_000
[<ffffffffa077d66e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa0a5e6da>] ptlrpc_wait_event+0x28a/0x290 [ptlrpc]
[<ffffffffa0a681a7>] ptlrpc_main+0x7f7/0x1700 [ptlrpc]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

...

On /mnt/lustre the HSM state of f0 shows as exists and archived. The coordinator does not show any actions or requests in flight.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 05/Aug/13 ]

I saw this kind of problems quite often before, the root cause was CT met errors so that the layout lock was failed to release which blocked the process on client.

Did you see any error messages printed CT daemon?

Comment by John Hammond [ 06/Aug/13 ]

No.

Comment by Jinshan Xiong (Inactive) [ 05/Sep/13 ]

This is solved.

Generated at Sat Feb 10 01:36:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.