Details
-
Technical task
-
Resolution: Fixed
-
Blocker
-
None
-
10098
Description
I did quota test today and found a problem with hsm_release. The test script is as follows:
#!/bin/bash setup() { ( cd srcs/lustre/lustre/tests; sh llmount.sh ) lctl set_param mdt.*.hsm_control=enabled rm -rf /tmp/arc mkdir /tmp/arc ~/srcs/lustre/lustre/utils/lhsmtool_posix --daemon --hsm-root /tmp/arc /mnt/lustre lctl conf_param lustre.quota.ost=u lctl conf_param lustre.quota.mdt=u } LFS=~/srcs/lustre/lustre/utils/lfs file=/mnt/lustre/testfile setup rm -f $file dd if=/dev/zero of=$file bs=1M count=30 chown tstusr.tstusr $file set -x $LFS hsm_archive $file while $LFS hsm_state $file |grep -qv archived; do sleep 1 done $LFS hsm_state $file lctl set_param debug=-1 lctl set_param debug_mb=500 lctl dk > /dev/null count=0 while :; do lctl mark "############# $count" count=$((count+1)) $LFS hsm_release $file $LFS hsm_state $file $LFS hsm_restore $file $LFS hsm_state $file sleep 1 done
The output on the console before the script hung:
+ /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_state /mnt/lustre/testfile + grep -qv archived + /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_state /mnt/lustre/testfile /mnt/lustre/testfile: (0x00000009) exists archived, archive_id:1 + lctl set_param debug=-1 debug=-1 + lctl set_param debug_mb=500 debug_mb=500 + lctl dk + count=0 + : + lctl mark '############# 0' + count=1 + /Users/jinxiong/srcs/lustre/lustre/utils/lfs hsm_release /mnt/lustre/testfile
It looks like the mdt thread was hung at finding local root object, for unknown reason, the local root object was being deleted. This sounds impossible but happened:
LNet: Service thread pid 2945 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 2945, comm: mdt_rdpg00_001 Call Trace: [<ffffffffa03c466e>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa056ffa7>] lu_object_find_at+0xb7/0x360 [obdclass] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0570266>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0bf5b16>] mdt_object_find+0x56/0x170 [mdt] [<ffffffffa0c264ef>] mdt_mfd_close+0x15ef/0x1b60 [mdt] [<ffffffffa03d3900>] ? libcfs_debug_vmsg2+0xba0/0xbb0 [libcfs] [<ffffffffa0c27e32>] mdt_close+0x682/0xac0 [mdt] [<ffffffffa0bffa4a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa0c39365>] mds_readpage_handle+0x15/0x20 [mdt] [<ffffffffa0709a55>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa03c454e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa03d540f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa03d3951>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa070ad9d>] ptlrpc_main+0xacd/0x1710 [ptlrpc] [<ffffffffa070a2d0>] ? ptlrpc_main+0x0/0x1710 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
I suspect this issue is related to quota because if I turned quota off everything became all right.