Details
-
Technical task
-
Resolution: Fixed
-
Major
-
Lustre 2.5.0
-
9543
Description
Using 2.4.52-90-g8b8b7b3 with the patch set 12 of the coordinator main thread, I see what looks like lost agent request/missing layout lock release. To reproduce, start HSM with two client mount (/mnt/lustre and /mnt/lustre2) and one mount for the CT (/mnt/lustre-hsm). Then do
# cd /mnt/lustre # touch f0 # lfs hsm_archive f0 # # Wait for archive to complete. # while true; do cat f0; done
# cd /mnt/lustre # while true; do lfs hsm_release f0; lfs hsm_restore f0; done
After a few seconds cat will hang in layout refresh, while the CT, the coordinator, and all of the mdt/ldlm threads will be idle:
q:lustre2# p-all cat 22461 cat [<ffffffffa04e866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa07f61fa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa07f0746>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc] [<ffffffffa07f1025>] ldlm_cli_enqueue+0x3a5/0x770 [ptlrpc] [<ffffffffa0a6ca2e>] mdc_enqueue+0x2ce/0x1a00 [mdc] [<ffffffffa0a1da76>] lmv_enqueue+0x336/0x1060 [lmv] [<ffffffffa0ee1e26>] ll_layout_refresh+0x556/0x1150 [lustre] [<ffffffffa0f2fc4b>] vvp_io_fini+0x16b/0x260 [lustre] [<ffffffffa0f310ec>] vvp_io_read_fini+0x5c/0x70 [lustre] [<ffffffffa06adcf7>] cl_io_fini+0x77/0x280 [obdclass] [<ffffffffa0ed0687>] ll_file_io_generic+0xe7/0x610 [lustre] [<ffffffffa0ed0cef>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa0ed158c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81182e05>] vfs_read+0xb5/0x1a0 [<ffffffff81182f41>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff q:lustre2# p-all hsm 6332 hsm_cdtr [<ffffffffa077d641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa05ebe0c>] mdt_coordinator+0xcac/0x1820 [mdt] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff 6344 lhsmtool_posix [<ffffffff8118cc3b>] pipe_wait+0x5b/0x80 [<ffffffff8118d6e6>] pipe_read+0x3e6/0x4e0 [<ffffffff8118251a>] do_sync_read+0xfa/0x140 [<ffffffff81182e05>] vfs_read+0xb5/0x1a0 [<ffffffff81182f41>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff 5792 mdt00_000 [<ffffffffa077d66e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa0a5e6da>] ptlrpc_wait_event+0x28a/0x290 [ptlrpc] [<ffffffffa0a681a7>] ptlrpc_main+0x7f7/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff ...
On /mnt/lustre the HSM state of f0 shows as exists and archived. The coordinator does not show any actions or requests in flight.