Details
-
Technical task
-
Resolution: Fixed
-
Blocker
-
None
-
10098
Description
I did quota test today and found a problem with hsm_release. The test script is as follows:
#!/bin/bash
setup() {
( cd srcs/lustre/lustre/tests; sh llmount.sh )
lctl set_param mdt.*.hsm_control=enabled
rm -rf /tmp/arc
mkdir /tmp/arc
~/srcs/lustre/lustre/utils/lhsmtool_posix --daemon --hsm-root /tmp/arc /mnt/lustre
lctl conf_param lustre.quota.ost=u
lctl conf_param lustre.quota.mdt=u
}
LFS=~/srcs/lustre/lustre/utils/lfs
file=/mnt/lustre/testfile
setup
rm -f $file
dd if=/dev/zero of=$file bs=1M count=30
chown tstusr.tstusr $file
set -x
$LFS hsm_archive $file
while $LFS hsm_state $file |grep -qv archived; do
sleep 1
done
$LFS hsm_state $file
lctl set_param debug=-1
lctl set_param debug_mb=500
lctl dk > /dev/null
count=0
while :; do
lctl mark "############# $count"
count=$((count+1))
$LFS hsm_release $file
$LFS hsm_state $file
$LFS hsm_restore $file
$LFS hsm_state $file
sleep 1
done
The output on the console before the script hung:
+ /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_state /mnt/lustre/testfile + grep -qv archived + /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_state /mnt/lustre/testfile /mnt/lustre/testfile: (0x00000009) exists archived, archive_id:1 + lctl set_param debug=-1 debug=-1 + lctl set_param debug_mb=500 debug_mb=500 + lctl dk + count=0 + : + lctl mark '############# 0' + count=1 + /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_release /mnt/lustre/testfile
It looks like the mdt thread was hung at finding local root object, for unknown reason, the local root object was being deleted. This sounds impossible but happened:
LNet: Service thread pid 2945 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 2945, comm: mdt_rdpg00_001 Call Trace: [<ffffffffa03c466e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa056ffa7>] lu_object_find_at+0xb7/0x360 [obdclass] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0570266>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0bf5b16>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0c264ef>] mdt_mfd_close+0x15ef/0x1b60 [mdt] [<ffffffffa03d3900>] ? libcfs_debug_vmsg2+0xba0/0xbb0 [libcfs] [<ffffffffa0c27e32>] mdt_close+0x682/0xac0 [mdt] [<ffffffffa0bffa4a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0c39365>] mds_readpage_handle+0x15/0x20 [mdt] [<ffffffffa0709a55>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa03c454e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa03d540f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa03d3951>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa070ad9d>] ptlrpc_main+0xacd/0x1710 [ptlrpc] [<ffffffffa070a2d0>] ? ptlrpc_main+0x0/0x1710 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
I suspect this issue is related to quota because if I turned quota off everything became all right.