Details
-
Technical task
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0
-
10387
Description
To reproduce, setup HSM, and do:
# cd /mnt/lustre # touch f0 # lfs hsm_archive f0 # # Wait for achive to complete. # echo +hsm > /proc/sys/lnet/debug # lctl clear; while true; do lfs hsm_release f0; done ... Cannot send HSM request (use of f0): Device or resource busy ... ## In another shell: # cd /mnt/lustre # while true; do sys_open f0 r; done ## calls open("f0", O_RDONLY)
Soon the MDT handler for open() will wedge trying to get a CR OPEN lock on f0.
# stack lfs 21597 lfs [<ffffffffa03029bd>] mdc_enqueue+0x22d/0x1a10 [mdc] [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc] [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv] [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre] [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre] [<ffffffff81191cf6>] do_lookup+0x66/0x230 [<ffffffff811925f4>] __link_path_walk+0x734/0x1030 [<ffffffff8119317a>] path_walk+0x6a/0xe0 [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0 [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0 [<ffffffff8117f849>] do_sys_open+0x69/0x140 [<ffffffff8117f960>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff # stack sys_open 21596 sys_open [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa1080730>] ptlrpc_set_wait+0x2f0/0x8c0 [ptlrpc] [<ffffffffa1080d87>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa105cee5>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc] [<ffffffffa0302a4e>] mdc_enqueue+0x2be/0x1a10 [mdc] [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc] [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv] [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv] [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre] [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre] [<ffffffff81191cf6>] do_lookup+0x66/0x230 [<ffffffff811925f4>] __link_path_walk+0x734/0x1030 [<ffffffff8119317a>] path_walk+0x6a/0xe0 [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0 [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0 [<ffffffff8117f849>] do_sys_open+0x69/0x140 [<ffffffff8117f960>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff # stack mdt00_002 20391 mdt00_002 [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa10620ad>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa10617d0>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc] [<ffffffffa0726c9b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0727514>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0750084>] mdt_object_open_lock+0x744/0x990 [mdt] [<ffffffffa0757a3f>] mdt_reint_open+0xf8f/0x20a0 [mdt] [<ffffffffa0740e71>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0728c63>] mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa07291ed>] mdt_intent_reint+0x1ed/0x520 [mdt] [<ffffffffa07248ce>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa1042441>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] [<ffffffffa106b0ef>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] [<ffffffffa0724d96>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa072ba5a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0765775>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa109aaa5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa109bded>] ptlrpc_main+0xacd/0x1710 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff
I think the close handler (for HSM release) cancels the EX OPEN lock before the LDLM_BL_CALLBACK can respond. Since the lock is already cancelled the resource never gets reprocessed and so the normal open lock is not granted.
Dumping the locks on the MDT side resource for f0 confirms this. We have a waiting CR lock which does not conflict with the grated locks.
--- Resource: [0x200000400:0x2:0x0].0 (ffff880198cc9800) refcount = 4 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff88021a008dc0/0x9af58bd61d165d82 lrc: 2/0,0 mode: CR/CR res: [0x200000400:0x2:0x0].0 bits 0x9 rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d66 expref: 10 pid: 20722 timeout: 0 lvb_type: 0 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880198c29940/0x9af58bd61d165d5f lrc: 2/0,0 mode: PR/PR res: [0x200000400:0x2:0x0].0 bits 0x1b rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d43 expref: 10 pid: 20722 timeout: 0 lvb_type: 0 ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801a6497980/0x9af58bd61d16ee59 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x4 rrc: 4 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20391 timeout: 0 lvb_type: 0