Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3647 HSM _not only_ small fixes and to do list goes here
  3. LU-3697

racing HSM release and restore vs cat leaves cat hung in ll_layout_refresh()

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 9543

    Description

      Using 2.4.52-90-g8b8b7b3 with the patch set 12 of the coordinator main thread, I see what looks like lost agent request/missing layout lock release. To reproduce, start HSM with two client mount (/mnt/lustre and /mnt/lustre2) and one mount for the CT (/mnt/lustre-hsm). Then do

      # cd /mnt/lustre
      # touch f0
      # lfs hsm_archive f0
      # # Wait for archive to complete.
      # while true; do cat f0; done
      
      # cd /mnt/lustre
      # while true; do lfs hsm_release f0; lfs hsm_restore f0; done
      

      After a few seconds cat will hang in layout refresh, while the CT, the coordinator, and all of the mdt/ldlm threads will be idle:

      q:lustre2# p-all cat
      22461 cat
      [<ffffffffa04e866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa07f61fa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa07f0746>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc]
      [<ffffffffa07f1025>] ldlm_cli_enqueue+0x3a5/0x770 [ptlrpc]
      [<ffffffffa0a6ca2e>] mdc_enqueue+0x2ce/0x1a00 [mdc]
      [<ffffffffa0a1da76>] lmv_enqueue+0x336/0x1060 [lmv]
      [<ffffffffa0ee1e26>] ll_layout_refresh+0x556/0x1150 [lustre]
      [<ffffffffa0f2fc4b>] vvp_io_fini+0x16b/0x260 [lustre]
      [<ffffffffa0f310ec>] vvp_io_read_fini+0x5c/0x70 [lustre]
      [<ffffffffa06adcf7>] cl_io_fini+0x77/0x280 [obdclass]
      [<ffffffffa0ed0687>] ll_file_io_generic+0xe7/0x610 [lustre]
      [<ffffffffa0ed0cef>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa0ed158c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff81182e05>] vfs_read+0xb5/0x1a0
      [<ffffffff81182f41>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      q:lustre2# p-all hsm
      6332 hsm_cdtr
      [<ffffffffa077d641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa05ebe0c>] mdt_coordinator+0xcac/0x1820 [mdt]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      6344 lhsmtool_posix
      [<ffffffff8118cc3b>] pipe_wait+0x5b/0x80
      [<ffffffff8118d6e6>] pipe_read+0x3e6/0x4e0
      [<ffffffff8118251a>] do_sync_read+0xfa/0x140
      [<ffffffff81182e05>] vfs_read+0xb5/0x1a0
      [<ffffffff81182f41>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      5792 mdt00_000
      [<ffffffffa077d66e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa0a5e6da>] ptlrpc_wait_event+0x28a/0x290 [ptlrpc]
      [<ffffffffa0a681a7>] ptlrpc_main+0x7f7/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      ...
      

      On /mnt/lustre the HSM state of f0 shows as exists and archived. The coordinator does not show any actions or requests in flight.

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: