Details

    • Technical task
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 9136

    Description

      Running the HSM stack as of July 15 2013, I see a hang when a release is issued while a restore is still running. To reproduce I run the following:

      #!/bin/bash
      
      export MOUNT_2=n
      export MDSCOUNT=1
      export PTLDEBUG="super inode ioctl warning dlmtrace error emerg ha rpctrace vfstrace config console"
      export DEBUG_SIZE=512
      
      hsm_root=/tmp/hsm_root
      
      rm -rf $hsm_root
      mkdir $hsm_root
      
      llmount.sh
      
      lctl conf_param lustre-MDT0000.mdt.hsm_control=enabled
      # lctl conf_param lustre-MDT0001.mdt.hsm_control=enabled
      sleep 10
      lhsmtool_posix --verbose --hsm_root=$hsm_root --bandwidth 1 lustre
      
      lctl dk > ~/hsm-0-mount.dk
      
      set -x
      cd /mnt/lustre
      lfs setstripe -c2 f0
      dd if=/dev/urandom of=f0 bs=1M count=100
      lctl dk > ~/hsm-1-dd.dk
      
      lfs hsm_archive f0
      sleep 10
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-2-archive.dk
      
      lfs hsm_release f0
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-3-release.dk
      
      lfs hsm_restore f0
      echo > /proc/fs/lustre/ldlm/dump_namespaces
      lctl dk > ~/hsm-4-restore.dk
      
      lfs hsm_release f0
      

      with the last command never returning. The MDS_CLOSE handler looks like

      10070
      [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
      [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
      [<ffffffffa08f9551>] mdt_mfd_close+0x351/0xde0 [mdt]
      [<ffffffffa08fb372>] mdt_close+0x662/0xa60 [mdt]
      [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
      [<ffffffffa090c9e5>] mds_readpage_handle+0x15/0x20 [mdt]
      [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      while the MDS_HSM_PROGRESS handler looks like:

      10065
      [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
      [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
      [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
      [<ffffffffa08cf721>] mdt_object_find_lock+0x61/0x170 [mdt]
      [<ffffffffa091dc22>] hsm_get_md_attr+0x62/0x270 [mdt]
      [<ffffffffa0923253>] mdt_hsm_update_request_state+0x4d3/0x1c20 [mdt]
      [<ffffffffa091ae6e>] mdt_hsm_coordinator_update+0x3e/0xe0 [mdt]
      [<ffffffffa090931b>] mdt_hsm_progress+0x21b/0x330 [mdt]
      [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
      [<ffffffffa090ca05>] mds_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
      [<ffffffff81096936>] kthread+0x96/0xa0
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      The close handler is waiting on an EX layout lock on f0. While the
      progress handler is waiting on PW update lock on f0. dump_namespaces does not show that the UPDATE lock is granted.

      For reference I'm using the following changes:

      # LU-2919 hsm: Implementation of exclusive open
      # http://review.whamcloud.com/#/c/6730
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6730/13 && git cherry-pick FETCH_HEAD
       
      # LU-1333 hsm: Add hsm_release feature.
      # http://review.whamcloud.com/#/c/6526
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/26/6526/9 && git cherry-pick FETCH_HEAD
       
      # LU-3339 mdt: HSM on disk actions record
      # http://review.whamcloud.com/#/c/6529
      # MERGED
       
      # LU-3340 mdt: HSM memory requests management
      # http://review.whamcloud.com/#/c/6530
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6530/8 && git cherry-pick FETCH_HEAD
       
      # LU-3341 mdt: HSM coordinator client interface
      # http://review.whamcloud.com/#/c/6532
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/32/6532/13 && git cherry-pick FETCH_HEAD
      # Needs rebase in sanity-hsm.sh
       
      # LU-3342 mdt: HSM coordinator agent interface
      # http://review.whamcloud.com/#/c/6534
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/34/6534/8 && git cherry-pick FETCH_HEAD
       
      # LU-3343 mdt: HSM coordinator main thread
      # http://review.whamcloud.com/#/c/6912
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/12/6912/3 && git cherry-pick FETCH_HEAD
      # lustre/mdt/mdt_internal.h
       
      # LU-3561 tests: HSM sanity test suite
      # http://review.whamcloud.com/#/c/6913/
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/13/6913/4 && git cherry-pick FETCH_HEAD
      # lustre/tests/sanity-hsm.sh
       
      # LU-3432 llite: Access to released file trigs a restore
      # http://review.whamcloud.com/#/c/6537
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/6537/11 && git cherry-pick FETCH_HEAD
       
      # LU-3363 api: HSM import uses new released pattern
      # http://review.whamcloud.com/#/c/6536
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/36/6536/8 && git cherry-pick FETCH_HEAD
       
      # LU-2062 utils: HSM Posix CopyTool
      # http://review.whamcloud.com/#/c/4737
      git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/4737/18 && git cherry-pick FETCH_HEAD
      

      Attachments

        Issue Links

          Activity

            [LU-3601] HSM release causes running restore to hang, hangs itself

            Should Change, 7148 be landed or abandoned?

            jlevi Jodi Levi (Inactive) added a comment - Should Change, 7148 be landed or abandoned?

            sanity-hsm #33 hits the same bug, but was not designed to test concurrent access to file during the restore phase. We also today do no test rename/rm during restore.

            jcl jacques-charles lafoucriere added a comment - sanity-hsm #33 hits the same bug, but was not designed to test concurrent access to file during the restore phase. We also today do no test rename/rm during restore.

            We already have such test. sanity-hsm #33 deadlock was hitting this bug. John's patch was fixing hit. I will confirm that the latest coordinator, without John's patch do not trig this deadlock anymore on monday, but I'm confident.

            adegremont Aurelien Degremont (Inactive) added a comment - We already have such test. sanity-hsm #33 deadlock was hitting this bug. John's patch was fixing hit. I will confirm that the latest coordinator, without John's patch do not trig this deadlock anymore on monday, but I'm confident.

            We will add sanity-hsm tests for the 2 simple use cases. Will be safer for futures changes.

            jcl jacques-charles lafoucriere added a comment - We will add sanity-hsm tests for the 2 simple use cases. Will be safer for futures changes.
            jhammond John Hammond added a comment -

            Since the removal of UPDATE lock use from the coordinator, I can no longer reproduce these issues.

            jhammond John Hammond added a comment - Since the removal of UPDATE lock use from the coordinator, I can no longer reproduce these issues.
            jhammond John Hammond added a comment - - edited

            A similar hang can be triggered by trying to read a file while a restore is still running. To see this add --bandwidth=1 to the copytool options and do:

            # cd /mnt/lustre
            # dd if=/dev/urandom of=f0 bs=1M count=10
            # lfs hsm_archive f0
            # # Wait for archive to complete.
            # sleep 15
            # lfs hsm_release f0
            # lfs hsm_restore f0
            # cat f0 > /dev/null
            

            This is addresses by the http://review.whamcloud.com/#/c/7148/.

            However even with the latest version (patch set 9) of http://review.whamcloud.com/#/c/6912/ we have an easily exploited race between restore and rename which is not addressed by the change in 7148. Rename onto during restore will hang:

            cd /mnt/lustre
            dd if=/dev/urandom of=f0 bs=1M count=10
            lfs hsm_archive f0
            # Wait for archive to complete.
            sleep 15
            lfs hsm_state f0
            lfs hsm_release f0
            lfs hsm_restore f0; touch f1; sys_rename f1 f0
            

            Since this rename takes MDS_INODELOCK_FULL on f0, I doubt that the choice of using LAYOUT, UPDATE, or other in hsm_get_md_attr() matters very much. But I could be wrong.

            jhammond John Hammond added a comment - - edited A similar hang can be triggered by trying to read a file while a restore is still running. To see this add --bandwidth=1 to the copytool options and do: # cd /mnt/lustre # dd if=/dev/urandom of=f0 bs=1M count=10 # lfs hsm_archive f0 # # Wait for archive to complete. # sleep 15 # lfs hsm_release f0 # lfs hsm_restore f0 # cat f0 > /dev/null This is addresses by the http://review.whamcloud.com/#/c/7148/ . However even with the latest version (patch set 9) of http://review.whamcloud.com/#/c/6912/ we have an easily exploited race between restore and rename which is not addressed by the change in 7148. Rename onto during restore will hang: cd /mnt/lustre dd if=/dev/urandom of=f0 bs=1M count=10 lfs hsm_archive f0 # Wait for archive to complete. sleep 15 lfs hsm_state f0 lfs hsm_release f0 lfs hsm_restore f0; touch f1; sys_rename f1 f0 Since this rename takes MDS_INODELOCK_FULL on f0, I doubt that the choice of using LAYOUT, UPDATE, or other in hsm_get_md_attr() matters very much. But I could be wrong.
            jhammond John Hammond added a comment -

            Please see http://review.whamcloud.com/7148 for the LDLM patch we discussed.

            jhammond John Hammond added a comment - Please see http://review.whamcloud.com/7148 for the LDLM patch we discussed.

            I will fix the lock issue above.

            The close sounds like a real issue here, we shouldn't block close REQ to finish. Let's use try version of mdt_object_lock() in close.

            jay Jinshan Xiong (Inactive) added a comment - I will fix the lock issue above. The close sounds like a real issue here, we shouldn't block close REQ to finish. Let's use try version of mdt_object_lock() in close.
            jhammond John Hammond added a comment - - edited

            Another issue here is that it may be unsafe to access the mount point being used by the copytool. Especially to perform manual HSM requests, since the MDC's cl_close_lock will prevent multiple concurrent closes. In particular we can have a releasing close block (on EX LAYOUT) because a restore is running, which prevents the restore from being completed, because any close will block on cl_close_lock.

            jhammond John Hammond added a comment - - edited Another issue here is that it may be unsafe to access the mount point being used by the copytool. Especially to perform manual HSM requests, since the MDC's cl_close_lock will prevent multiple concurrent closes. In particular we can have a releasing close block (on EX LAYOUT) because a restore is running, which prevents the restore from being completed, because any close will block on cl_close_lock.
            jhammond John Hammond added a comment -

            Here is a simpler situation where we can get stuck. (It is also more likely to occur.) Consider the following release vs open race. Assume the file F has already been archived.

            1. Client R starts HSM release on file F.
            2. In lfs_hsm_request, R stats F, the MDT returns a PR LOOKUP,UPDATE,LAYOUT,PERM lock on F.
            3. In lfs_hsm_request, R opens F for path2fid, the MDT returns a CR LOOKUP,LAYOUT lock on F.
            4. In ll_hsm_release/ll_lease_open, R leases F, the MDT returns an EX OPEN lock on F.
            5. Client W tries to open F with MDS_OPEN_LOCK set, the MDT adds a CW OPEN lock to the waiting list.
            6. In ll_hsm_release, client R closes F.
            7. In mdt_hsm_release, the MDT requests a local EX LAYOUT on F. This conflicts with the PR and CR locks already held by R, the server sends blocking ASTs to R for these locks.
            8. The MDT reprocesses the waiting queue for F. Granted list contains the EX OPEN lock. The waiting list contains the CW OPEN, followed by the EX LAYOUT.
            9. As responses to the blocking ASTs come in the F is reprocessed but since there is a blocked CW OPEN lock at the head of the waiting list, the following locks (including the EX LAYOUT) are not considered.
            10. The EX OPEN lock times out and client R is evicted.
            jhammond John Hammond added a comment - Here is a simpler situation where we can get stuck. (It is also more likely to occur.) Consider the following release vs open race. Assume the file F has already been archived. Client R starts HSM release on file F. In lfs_hsm_request, R stats F, the MDT returns a PR LOOKUP,UPDATE,LAYOUT,PERM lock on F. In lfs_hsm_request, R opens F for path2fid, the MDT returns a CR LOOKUP,LAYOUT lock on F. In ll_hsm_release/ll_lease_open, R leases F, the MDT returns an EX OPEN lock on F. Client W tries to open F with MDS_OPEN_LOCK set, the MDT adds a CW OPEN lock to the waiting list. In ll_hsm_release, client R closes F. In mdt_hsm_release, the MDT requests a local EX LAYOUT on F. This conflicts with the PR and CR locks already held by R, the server sends blocking ASTs to R for these locks. The MDT reprocesses the waiting queue for F. Granted list contains the EX OPEN lock. The waiting list contains the CW OPEN, followed by the EX LAYOUT. As responses to the blocking ASTs come in the F is reprocessed but since there is a blocked CW OPEN lock at the head of the waiting list, the following locks (including the EX LAYOUT) are not considered. The EX OPEN lock times out and client R is evicted.
            jhammond John Hammond added a comment -

            I believe that this situation exposes a limitation of LDLM for inodebits locks. All locks below are on f0.

            1. Starting the restore takes EX LAYOUT lock on the server.
            2. When the releasing close RPC is sent the client has a PR LOOKUP|UPDATE|PERM lock.
            3. The release handler on the server blocks attempting to take an EX LAYOUT lock.
            4. When restore complete, the update progress handler blocks attempting to take an PW UPDATE lock.
            5. The client releases the PR LOOKUP|UPDATE|PERM lock.
            6. The resource (f0) gets reprocessed, but the first waiting lock (EX LAYOUT) cannot be granted, so ldlm_process_inodebits_lock() returns LDLM_ITER_STOP causing ldlm_reprocess_queue() to stop processing the resource. In particular it does not check that the PW UPDATE lock is compatible with all of the granted locks and all of the locks before it in the waiting list.

            It also appears that the skip list optimizations in ldlm_inodebits_compat_queue() could be extended/improved by computing compatibility one mode-bits-bunch at a time and by granting locks in bunches.

            jhammond John Hammond added a comment - I believe that this situation exposes a limitation of LDLM for inodebits locks. All locks below are on f0. Starting the restore takes EX LAYOUT lock on the server. When the releasing close RPC is sent the client has a PR LOOKUP|UPDATE|PERM lock. The release handler on the server blocks attempting to take an EX LAYOUT lock. When restore complete, the update progress handler blocks attempting to take an PW UPDATE lock. The client releases the PR LOOKUP|UPDATE|PERM lock. The resource (f0) gets reprocessed, but the first waiting lock (EX LAYOUT) cannot be granted, so ldlm_process_inodebits_lock() returns LDLM_ITER_STOP causing ldlm_reprocess_queue() to stop processing the resource. In particular it does not check that the PW UPDATE lock is compatible with all of the granted locks and all of the locks before it in the waiting list. It also appears that the skip list optimizations in ldlm_inodebits_compat_queue() could be extended/improved by computing compatibility one mode-bits-bunch at a time and by granting locks in bunches.

            People

              jay Jinshan Xiong (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: