HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Technical task | Priority: | Major |
| Reporter: | John Hammond | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Rank (Obsolete): | 9543 |
| Description |
|
Using 2.4.52-90-g8b8b7b3 with the patch set 12 of the coordinator main thread, I see what looks like lost agent request/missing layout lock release. To reproduce, start HSM with two client mount (/mnt/lustre and /mnt/lustre2) and one mount for the CT (/mnt/lustre-hsm). Then do # cd /mnt/lustre # touch f0 # lfs hsm_archive f0 # # Wait for archive to complete. # while true; do cat f0; done # cd /mnt/lustre # while true; do lfs hsm_release f0; lfs hsm_restore f0; done After a few seconds cat will hang in layout refresh, while the CT, the coordinator, and all of the mdt/ldlm threads will be idle: q:lustre2# p-all cat 22461 cat [<ffffffffa04e866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa07f61fa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa07f0746>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc] [<ffffffffa07f1025>] ldlm_cli_enqueue+0x3a5/0x770 [ptlrpc] [<ffffffffa0a6ca2e>] mdc_enqueue+0x2ce/0x1a00 [mdc] [<ffffffffa0a1da76>] lmv_enqueue+0x336/0x1060 [lmv] [<ffffffffa0ee1e26>] ll_layout_refresh+0x556/0x1150 [lustre] [<ffffffffa0f2fc4b>] vvp_io_fini+0x16b/0x260 [lustre] [<ffffffffa0f310ec>] vvp_io_read_fini+0x5c/0x70 [lustre] [<ffffffffa06adcf7>] cl_io_fini+0x77/0x280 [obdclass] [<ffffffffa0ed0687>] ll_file_io_generic+0xe7/0x610 [lustre] [<ffffffffa0ed0cef>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa0ed158c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81182e05>] vfs_read+0xb5/0x1a0 [<ffffffff81182f41>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff q:lustre2# p-all hsm 6332 hsm_cdtr [<ffffffffa077d641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa05ebe0c>] mdt_coordinator+0xcac/0x1820 [mdt] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff 6344 lhsmtool_posix [<ffffffff8118cc3b>] pipe_wait+0x5b/0x80 [<ffffffff8118d6e6>] pipe_read+0x3e6/0x4e0 [<ffffffff8118251a>] do_sync_read+0xfa/0x140 [<ffffffff81182e05>] vfs_read+0xb5/0x1a0 [<ffffffff81182f41>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff 5792 mdt00_000 [<ffffffffa077d66e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa0a5e6da>] ptlrpc_wait_event+0x28a/0x290 [ptlrpc] [<ffffffffa0a681a7>] ptlrpc_main+0x7f7/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff ... On /mnt/lustre the HSM state of f0 shows as exists and archived. The coordinator does not show any actions or requests in flight. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 05/Aug/13 ] |
|
I saw this kind of problems quite often before, the root cause was CT met errors so that the layout lock was failed to release which blocked the process on client. Did you see any error messages printed CT daemon? |
| Comment by John Hammond [ 06/Aug/13 ] |
|
No. |
| Comment by Jinshan Xiong (Inactive) [ 05/Sep/13 ] |
|
This is solved. |