HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Technical task | Priority: | Blocker |
| Reporter: | John Hammond | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Rank (Obsolete): | 10387 |
| Description |
|
To reproduce, setup HSM, and do:
# cd /mnt/lustre
# touch f0
# lfs hsm_archive f0
# # Wait for achive to complete.
# echo +hsm > /proc/sys/lnet/debug
# lctl clear; while true; do lfs hsm_release f0; done
...
Cannot send HSM request (use of f0): Device or resource busy
...
## In another shell:
# cd /mnt/lustre
# while true; do sys_open f0 r; done ## calls open("f0", O_RDONLY)
Soon the MDT handler for open() will wedge trying to get a CR OPEN lock on f0. # stack lfs 21597 lfs [<ffffffffa03029bd>] mdc_enqueue+0x22d/0x1a10 [mdc] [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc] [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv] [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre] [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre] [<ffffffff81191cf6>] do_lookup+0x66/0x230 [<ffffffff811925f4>] __link_path_walk+0x734/0x1030 [<ffffffff8119317a>] path_walk+0x6a/0xe0 [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0 [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0 [<ffffffff8117f849>] do_sys_open+0x69/0x140 [<ffffffff8117f960>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff # stack sys_open 21596 sys_open [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa1080730>] ptlrpc_set_wait+0x2f0/0x8c0 [ptlrpc] [<ffffffffa1080d87>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa105cee5>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc] [<ffffffffa0302a4e>] mdc_enqueue+0x2be/0x1a10 [mdc] [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc] [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv] [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre] [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre] [<ffffffff81191cf6>] do_lookup+0x66/0x230 [<ffffffff811925f4>] __link_path_walk+0x734/0x1030 [<ffffffff8119317a>] path_walk+0x6a/0xe0 [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0 [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0 [<ffffffff8117f849>] do_sys_open+0x69/0x140 [<ffffffff8117f960>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff # stack mdt00_002 20391 mdt00_002 [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa10620ad>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa10617d0>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc] [<ffffffffa0726c9b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0727514>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0750084>] mdt_object_open_lock+0x744/0x990 [mdt] [<ffffffffa0757a3f>] mdt_reint_open+0xf8f/0x20a0 [mdt] [<ffffffffa0740e71>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0728c63>] mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa07291ed>] mdt_intent_reint+0x1ed/0x520 [mdt] [<ffffffffa07248ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa1042441>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa106b0ef>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0724d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa072ba5a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0765775>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa109aaa5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa109bded>] ptlrpc_main+0xacd/0x1710 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff I think the close handler (for HSM release) cancels the EX OPEN lock before the LDLM_BL_CALLBACK can respond. Since the lock is already cancelled the resource never gets reprocessed and so the normal open lock is not granted. --- Resource: [0x200000400:0x2:0x0].0 (ffff880198cc9800) refcount = 4 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff88021a008dc0/0x9af58bd61d165d82 lrc: 2/0,0 mode: CR/CR res: [0x200000400:0x2:0x0].0 bits 0x9 rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d66 expref: 10 pid: 20722 timeout: 0 lvb_type: 0 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880198c29940/0x9af58bd61d165d5f lrc: 2/0,0 mode: PR/PR res: [0x200000400:0x2:0x0].0 bits 0x1b rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d43 expref: 10 pid: 20722 timeout: 0 lvb_type: 0 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801a6497980/0x9af58bd61d16ee59 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x4 rrc: 4 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20391 timeout: 0 lvb_type: 0 |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 11/Sep/13 ] |
|
indeed, can you please work out a patch for this? |
| Comment by John Hammond [ 11/Sep/13 ] |
|
Please see http://review.whamcloud.com/7621. |