Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3647 HSM _not only_ small fixes and to do list goes here
  3. LU-3601

HSM release causes running restore to hang, hangs itself

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 9136

    Description

      Running the HSM stack as of July 15 2013, I see a hang when a release is issued while a restore is still running. To reproduce I run the following:

      #!/bin/bash
      
      export MOUNT_2=n
      export MDSCOUNT=1
      export PTLDEBUG="super inode ioctl warning dlmtrace error emerg ha rpctrace vfstrace config console"
      export DEBUG_SIZE=512
      
      hsm_root=/tmp/hsm_root
      
      rm -rf $hsm_root
      mkdir $hsm_root
      
      llmount.sh
      
      lctl conf_param lustre-MDT0000.mdt.hsm_control=enabled
      # lctl conf_param lustre-MDT0001.mdt.hsm_control=enabled
      sleep 10
      lhsmtool_posix --verbose --hsm_root=$hsm_root --bandwidth 1 lustre
      
      lctl dk > ~/hsm-0-mount.dk
      
      set -x
      cd /mnt/lustre
      lfs setstripe -c2 f0
      dd if=/dev/urandom of=f0 bs=1M count=100
      lctl dk > ~/hsm-1-dd.dk
      
      lfs hsm_archive f0
      sleep 10
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-2-archive.dk
      
      lfs hsm_release f0
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-3-release.dk
      
      lfs hsm_restore f0
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-4-restore.dk
      
      lfs hsm_release f0
      

      with the last command never returning. The MDS_CLOSE handler looks like

      10070
      [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
      [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
      [<ffffffffa08f9551>] mdt_mfd_close+0x351/0xde0 [mdt]
      [<ffffffffa08fb372>] mdt_close+0x662/0xa60 [mdt]
      [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
      [<ffffffffa090c9e5>] mds_readpage_handle+0x15/0x20 [mdt]
      [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      while the MDS_HSM_PROGRESS handler looks like:

      10065
      [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
      [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
      [<ffffffffa08cf721>] mdt_object_find_lock+0x61/0x170 [mdt]
      [<ffffffffa091dc22>] hsm_get_md_attr+0x62/0x270 [mdt]
      [<ffffffffa0923253>] mdt_hsm_update_request_state+0x4d3/0x1c20 [mdt]
      [<ffffffffa091ae6e>] mdt_hsm_coordinator_update+0x3e/0xe0 [mdt]
      [<ffffffffa090931b>] mdt_hsm_progress+0x21b/0x330 [mdt]
      [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
      [<ffffffffa090ca05>] mds_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The close handler is waiting on an EX layout lock on f0. While the
      progress handler is waiting on PW update lock on f0. dump_namespaces does not show that the UPDATE lock is granted.

      For reference I'm using the following changes:

      # LU-2919 hsm: Implementation of exclusive open
      # http://review.whamcloud.com/#/c/6730
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6730/13 && git cherry-pick FETCH_HEAD
       
      # LU-1333 hsm: Add hsm_release feature.
      # http://review.whamcloud.com/#/c/6526
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/26/6526/9 && git cherry-pick FETCH_HEAD
       
      # LU-3339 mdt: HSM on disk actions record
      # http://review.whamcloud.com/#/c/6529
      # MERGED
       
      # LU-3340 mdt: HSM memory requests management
      # http://review.whamcloud.com/#/c/6530
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6530/8 && git cherry-pick FETCH_HEAD
       
      # LU-3341 mdt: HSM coordinator client interface
      # http://review.whamcloud.com/#/c/6532
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/32/6532/13 && git cherry-pick FETCH_HEAD
      # Needs rebase in sanity-hsm.sh
       
      # LU-3342 mdt: HSM coordinator agent interface
      # http://review.whamcloud.com/#/c/6534
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/34/6534/8 && git cherry-pick FETCH_HEAD
       
      # LU-3343 mdt: HSM coordinator main thread
      # http://review.whamcloud.com/#/c/6912
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/12/6912/3 && git cherry-pick FETCH_HEAD
      # lustre/mdt/mdt_internal.h
       
      # LU-3561 tests: HSM sanity test suite
      # http://review.whamcloud.com/#/c/6913/
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/13/6913/4 && git cherry-pick FETCH_HEAD
      # lustre/tests/sanity-hsm.sh
       
      # LU-3432 llite: Access to released file trigs a restore
      # http://review.whamcloud.com/#/c/6537
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/6537/11 && git cherry-pick FETCH_HEAD
       
      # LU-3363 api: HSM import uses new released pattern
      # http://review.whamcloud.com/#/c/6536
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/36/6536/8 && git cherry-pick FETCH_HEAD
       
      # LU-2062 utils: HSM Posix CopyTool
      # http://review.whamcloud.com/#/c/4737
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/4737/18 && git cherry-pick FETCH_HEAD
      

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: