Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3647 HSM _not only_ small fixes and to do list goes here
  3. LU-3697

racing HSM release and restore vs cat leaves cat hung in ll_layout_refresh()

Details

    • Technical task
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 9543

    Description

      Using 2.4.52-90-g8b8b7b3 with the patch set 12 of the coordinator main thread, I see what looks like lost agent request/missing layout lock release. To reproduce, start HSM with two client mount (/mnt/lustre and /mnt/lustre2) and one mount for the CT (/mnt/lustre-hsm). Then do

      # cd /mnt/lustre
      # touch f0
      # lfs hsm_archive f0
      # # Wait for archive to complete.
      # while true; do cat f0; done
      
      # cd /mnt/lustre
      # while true; do lfs hsm_release f0; lfs hsm_restore f0; done
      

      After a few seconds cat will hang in layout refresh, while the CT, the coordinator, and all of the mdt/ldlm threads will be idle:

      q:lustre2# p-all cat
      22461 cat
      [<ffffffffa04e866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa07f61fa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa07f0746>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc]
      [<ffffffffa07f1025>] ldlm_cli_enqueue+0x3a5/0x770 [ptlrpc]
      [<ffffffffa0a6ca2e>] mdc_enqueue+0x2ce/0x1a00 [mdc]
      [<ffffffffa0a1da76>] lmv_enqueue+0x336/0x1060 [lmv]
      [<ffffffffa0ee1e26>] ll_layout_refresh+0x556/0x1150 [lustre]
      [<ffffffffa0f2fc4b>] vvp_io_fini+0x16b/0x260 [lustre]
      [<ffffffffa0f310ec>] vvp_io_read_fini+0x5c/0x70 [lustre]
      [<ffffffffa06adcf7>] cl_io_fini+0x77/0x280 [obdclass]
      [<ffffffffa0ed0687>] ll_file_io_generic+0xe7/0x610 [lustre]
      [<ffffffffa0ed0cef>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa0ed158c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81182e05>] vfs_read+0xb5/0x1a0
      [<ffffffff81182f41>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      q:lustre2# p-all hsm
      6332 hsm_cdtr
      [<ffffffffa077d641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa05ebe0c>] mdt_coordinator+0xcac/0x1820 [mdt]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      6344 lhsmtool_posix
      [<ffffffff8118cc3b>] pipe_wait+0x5b/0x80
      [<ffffffff8118d6e6>] pipe_read+0x3e6/0x4e0
      [<ffffffff8118251a>] do_sync_read+0xfa/0x140
      [<ffffffff81182e05>] vfs_read+0xb5/0x1a0
      [<ffffffff81182f41>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      5792 mdt00_000
      [<ffffffffa077d66e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa0a5e6da>] ptlrpc_wait_event+0x28a/0x290 [ptlrpc]
      [<ffffffffa0a681a7>] ptlrpc_main+0x7f7/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      ...
      

      On /mnt/lustre the HSM state of f0 shows as exists and archived. The coordinator does not show any actions or requests in flight.

      Attachments

        Activity

          [LU-3697] racing HSM release and restore vs cat leaves cat hung in ll_layout_refresh()

          People

            jay Jinshan Xiong (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: