Details
-
Technical task
-
Resolution: Fixed
-
Major
-
Lustre 2.5.0
-
9136
Description
Running the HSM stack as of July 15 2013, I see a hang when a release is issued while a restore is still running. To reproduce I run the following:
#!/bin/bash export MOUNT_2=n export MDSCOUNT=1 export PTLDEBUG="super inode ioctl warning dlmtrace error emerg ha rpctrace vfstrace config console" export DEBUG_SIZE=512 hsm_root=/tmp/hsm_root rm -rf $hsm_root mkdir $hsm_root llmount.sh lctl conf_param lustre-MDT0000.mdt.hsm_control=enabled # lctl conf_param lustre-MDT0001.mdt.hsm_control=enabled sleep 10 lhsmtool_posix --verbose --hsm_root=$hsm_root --bandwidth 1 lustre lctl dk > ~/hsm-0-mount.dk set -x cd /mnt/lustre lfs setstripe -c2 f0 dd if=/dev/urandom of=f0 bs=1M count=100 lctl dk > ~/hsm-1-dd.dk lfs hsm_archive f0 sleep 10 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-2-archive.dk lfs hsm_release f0 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-3-release.dk lfs hsm_restore f0 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-4-restore.dk lfs hsm_release f0
with the last command never returning. The MDS_CLOSE handler looks like
10070 [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc] [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa08f9551>] mdt_mfd_close+0x351/0xde0 [mdt] [<ffffffffa08fb372>] mdt_close+0x662/0xa60 [mdt] [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa090c9e5>] mds_readpage_handle+0x15/0x20 [mdt] [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff
while the MDS_HSM_PROGRESS handler looks like:
10065 [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc] [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa08cf721>] mdt_object_find_lock+0x61/0x170 [mdt] [<ffffffffa091dc22>] hsm_get_md_attr+0x62/0x270 [mdt] [<ffffffffa0923253>] mdt_hsm_update_request_state+0x4d3/0x1c20 [mdt] [<ffffffffa091ae6e>] mdt_hsm_coordinator_update+0x3e/0xe0 [mdt] [<ffffffffa090931b>] mdt_hsm_progress+0x21b/0x330 [mdt] [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa090ca05>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff
The close handler is waiting on an EX layout lock on f0. While the
progress handler is waiting on PW update lock on f0. dump_namespaces does not show that the UPDATE lock is granted.
For reference I'm using the following changes:
# LU-2919 hsm: Implementation of exclusive open # http://review.whamcloud.com/#/c/6730 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6730/13 && git cherry-pick FETCH_HEAD # LU-1333 hsm: Add hsm_release feature. # http://review.whamcloud.com/#/c/6526 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/26/6526/9 && git cherry-pick FETCH_HEAD # LU-3339 mdt: HSM on disk actions record # http://review.whamcloud.com/#/c/6529 # MERGED # LU-3340 mdt: HSM memory requests management # http://review.whamcloud.com/#/c/6530 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6530/8 && git cherry-pick FETCH_HEAD # LU-3341 mdt: HSM coordinator client interface # http://review.whamcloud.com/#/c/6532 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/32/6532/13 && git cherry-pick FETCH_HEAD # Needs rebase in sanity-hsm.sh # LU-3342 mdt: HSM coordinator agent interface # http://review.whamcloud.com/#/c/6534 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/34/6534/8 && git cherry-pick FETCH_HEAD # LU-3343 mdt: HSM coordinator main thread # http://review.whamcloud.com/#/c/6912 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/12/6912/3 && git cherry-pick FETCH_HEAD # lustre/mdt/mdt_internal.h # LU-3561 tests: HSM sanity test suite # http://review.whamcloud.com/#/c/6913/ git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/13/6913/4 && git cherry-pick FETCH_HEAD # lustre/tests/sanity-hsm.sh # LU-3432 llite: Access to released file trigs a restore # http://review.whamcloud.com/#/c/6537 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/6537/11 && git cherry-pick FETCH_HEAD # LU-3363 api: HSM import uses new released pattern # http://review.whamcloud.com/#/c/6536 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/36/6536/8 && git cherry-pick FETCH_HEAD # LU-2062 utils: HSM Posix CopyTool # http://review.whamcloud.com/#/c/4737 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/4737/18 && git cherry-pick FETCH_HEAD