Details
-
Technical task
-
Resolution: Fixed
-
Major
-
Lustre 2.5.0
-
9136
Description
Running the HSM stack as of July 15 2013, I see a hang when a release is issued while a restore is still running. To reproduce I run the following:
#!/bin/bash export MOUNT_2=n export MDSCOUNT=1 export PTLDEBUG="super inode ioctl warning dlmtrace error emerg ha rpctrace vfstrace config console" export DEBUG_SIZE=512 hsm_root=/tmp/hsm_root rm -rf $hsm_root mkdir $hsm_root llmount.sh lctl conf_param lustre-MDT0000.mdt.hsm_control=enabled # lctl conf_param lustre-MDT0001.mdt.hsm_control=enabled sleep 10 lhsmtool_posix --verbose --hsm_root=$hsm_root --bandwidth 1 lustre lctl dk > ~/hsm-0-mount.dk set -x cd /mnt/lustre lfs setstripe -c2 f0 dd if=/dev/urandom of=f0 bs=1M count=100 lctl dk > ~/hsm-1-dd.dk lfs hsm_archive f0 sleep 10 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-2-archive.dk lfs hsm_release f0 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-3-release.dk lfs hsm_restore f0 echo > /proc/fs/lustre/ldlm/dump_namespaces lctl dk > ~/hsm-4-restore.dk lfs hsm_release f0
with the last command never returning. The MDS_CLOSE handler looks like
10070 [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc] [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa08f9551>] mdt_mfd_close+0x351/0xde0 [mdt] [<ffffffffa08fb372>] mdt_close+0x662/0xa60 [mdt] [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa090c9e5>] mds_readpage_handle+0x15/0x20 [mdt] [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff
while the MDS_HSM_PROGRESS handler looks like:
10065 [<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] [<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc] [<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa08cf721>] mdt_object_find_lock+0x61/0x170 [mdt] [<ffffffffa091dc22>] hsm_get_md_attr+0x62/0x270 [mdt] [<ffffffffa0923253>] mdt_hsm_update_request_state+0x4d3/0x1c20 [mdt] [<ffffffffa091ae6e>] mdt_hsm_coordinator_update+0x3e/0xe0 [mdt] [<ffffffffa090931b>] mdt_hsm_progress+0x21b/0x330 [mdt] [<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa090ca05>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff
The close handler is waiting on an EX layout lock on f0. While the
progress handler is waiting on PW update lock on f0. dump_namespaces does not show that the UPDATE lock is granted.
For reference I'm using the following changes:
# LU-2919 hsm: Implementation of exclusive open # http://review.whamcloud.com/#/c/6730 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6730/13 && git cherry-pick FETCH_HEAD # LU-1333 hsm: Add hsm_release feature. # http://review.whamcloud.com/#/c/6526 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/26/6526/9 && git cherry-pick FETCH_HEAD # LU-3339 mdt: HSM on disk actions record # http://review.whamcloud.com/#/c/6529 # MERGED # LU-3340 mdt: HSM memory requests management # http://review.whamcloud.com/#/c/6530 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6530/8 && git cherry-pick FETCH_HEAD # LU-3341 mdt: HSM coordinator client interface # http://review.whamcloud.com/#/c/6532 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/32/6532/13 && git cherry-pick FETCH_HEAD # Needs rebase in sanity-hsm.sh # LU-3342 mdt: HSM coordinator agent interface # http://review.whamcloud.com/#/c/6534 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/34/6534/8 && git cherry-pick FETCH_HEAD # LU-3343 mdt: HSM coordinator main thread # http://review.whamcloud.com/#/c/6912 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/12/6912/3 && git cherry-pick FETCH_HEAD # lustre/mdt/mdt_internal.h # LU-3561 tests: HSM sanity test suite # http://review.whamcloud.com/#/c/6913/ git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/13/6913/4 && git cherry-pick FETCH_HEAD # lustre/tests/sanity-hsm.sh # LU-3432 llite: Access to released file trigs a restore # http://review.whamcloud.com/#/c/6537 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/6537/11 && git cherry-pick FETCH_HEAD # LU-3363 api: HSM import uses new released pattern # http://review.whamcloud.com/#/c/6536 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/36/6536/8 && git cherry-pick FETCH_HEAD # LU-2062 utils: HSM Posix CopyTool # http://review.whamcloud.com/#/c/4737 git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/4737/18 && git cherry-pick FETCH_HEAD
Oleg - We hit it while testing NFS exported Lustre during a large-ish test run, with tests drawn primarily from the Linux Test Project. The problem is we don't always hit it with the same test.
The test engineer who's been handling it thinks a way to hit it is concurrent runs of fsx-linux with different command line options. Those are being run against an NFS export of Lustre.
He's going to try to pin that down this afternoon, I'll update if he's able to be more specific.