Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3647 HSM _not only_ small fixes and to do list goes here
  3. LU-3931

mdt_hsm_release() calls ldlm_lock_cancel() but does not reprocess resource

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 10387

    Description

      To reproduce, setup HSM, and do:

      # cd /mnt/lustre
      # touch f0
      # lfs hsm_archive f0
      # # Wait for achive to complete.
      # echo +hsm > /proc/sys/lnet/debug
      # lctl clear; while true; do lfs hsm_release f0; done
      ...
      Cannot send HSM request (use of f0): Device or resource busy
      ...
      
      ## In another shell:
      # cd /mnt/lustre
      # while true; do sys_open f0 r; done ## calls open("f0", O_RDONLY)
      

      Soon the MDT handler for open() will wedge trying to get a CR OPEN lock on f0.

      # stack lfs
      21597 lfs
      [<ffffffffa03029bd>] mdc_enqueue+0x22d/0x1a10 [mdc]
      [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc]
      [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv]
      [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv]
      [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre]
      [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre]
      [<ffffffff81191cf6>] do_lookup+0x66/0x230
      [<ffffffff811925f4>] __link_path_walk+0x734/0x1030
      [<ffffffff8119317a>] path_walk+0x6a/0xe0
      [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0
      [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0
      [<ffffffff8117f849>] do_sys_open+0x69/0x140
      [<ffffffff8117f960>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      # stack sys_open
      21596 sys_open
      [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa1080730>] ptlrpc_set_wait+0x2f0/0x8c0 [ptlrpc]
      [<ffffffffa1080d87>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
      [<ffffffffa105cee5>] ldlm_cli_enqueue+0x365/0x790 [ptlrpc]
      [<ffffffffa0302a4e>] mdc_enqueue+0x2be/0x1a10 [mdc]
      [<ffffffffa030439d>] mdc_intent_lock+0x1fd/0x64a [mdc]
      [<ffffffffa02c01b3>] lmv_intent_open+0x213/0x8d0 [lmv]
      [<ffffffffa02c0b2b>] lmv_intent_lock+0x2bb/0x380 [lmv]
      [<ffffffffa0923b25>] ll_revalidate_it+0x275/0x1b20 [lustre]
      [<ffffffffa0925503>] ll_revalidate_nd+0x133/0x3e0 [lustre]
      [<ffffffff81191cf6>] do_lookup+0x66/0x230
      [<ffffffff811925f4>] __link_path_walk+0x734/0x1030
      [<ffffffff8119317a>] path_walk+0x6a/0xe0
      [<ffffffff8119334b>] do_path_lookup+0x5b/0xa0
      [<ffffffff8119428b>] do_filp_open+0xfb/0xdc0
      [<ffffffff8117f849>] do_sys_open+0x69/0x140
      [<ffffffff8117f960>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      # stack mdt00_002
      20391 mdt00_002
      [<ffffffffa0de0641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa10620ad>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
      [<ffffffffa10617d0>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      [<ffffffffa0726c9b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      [<ffffffffa0727514>] mdt_object_lock+0x14/0x20 [mdt]
      [<ffffffffa0750084>] mdt_object_open_lock+0x744/0x990 [mdt]
      [<ffffffffa0757a3f>] mdt_reint_open+0xf8f/0x20a0 [mdt]
      [<ffffffffa0740e71>] mdt_reint_rec+0x41/0xe0 [mdt]
      [<ffffffffa0728c63>] mdt_reint_internal+0x4c3/0x780 [mdt]
      [<ffffffffa07291ed>] mdt_intent_reint+0x1ed/0x520 [mdt]
      [<ffffffffa07248ce>] mdt_intent_policy+0x3ae/0x770 [mdt]
      [<ffffffffa1042441>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
      [<ffffffffa106b0ef>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
      [<ffffffffa0724d96>] mdt_enqueue+0x46/0xe0 [mdt]
      [<ffffffffa072ba5a>] mdt_handle_common+0x52a/0x1470 [mdt]
      [<ffffffffa0765775>] mds_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa109aaa5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      [<ffffffffa109bded>] ptlrpc_main+0xacd/0x1710 [ptlrpc]
      [<ffffffff81096a36>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      I think the close handler (for HSM release) cancels the EX OPEN lock before the LDLM_BL_CALLBACK can respond. Since the lock is already cancelled the resource never gets reprocessed and so the normal open lock is not granted.
      Dumping the locks on the MDT side resource for f0 confirms this. We have a waiting CR lock which does not conflict with the grated locks.

      --- Resource: [0x200000400:0x2:0x0].0 (ffff880198cc9800) refcount = 4
      ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff88021a008dc0/0x9af58bd61d165d82 lrc: 2/0,0 mode: CR/CR res: [0x200000400:0x2:0x0].0 bits 0x9 rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d66 expref: 10 pid: 20722 timeout: 0 lvb_type: 0
      ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880198c29940/0x9af58bd61d165d5f lrc: 2/0,0 mode: PR/PR res: [0x200000400:0x2:0x0].0 bits 0x1b rrc: 4 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x9af58bd61d165d43 expref: 10 pid: 20722 timeout: 0 lvb_type: 0
      ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801a6497980/0x9af58bd61d16ee59 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x4 rrc: 4 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20391 timeout: 0 lvb_type: 0
      

      Attachments

        Activity

          People

            jhammond John Hammond
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: