[LU-4176] Test failure on test suite sanity-hsm, subtest test_31a Created: 29/Oct/13 Updated: 03/Apr/16 Resolved: 23/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Severity: | 3 |
| Rank (Obsolete): | 11307 |
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6bbd2b34-407a-11e3-af95-52540035b04c. The sub-test test_31a failed with the following error:
Info required for matching: sanity-hsm 31a This test was EXCEPT'd prior to http://review.whamcloud.com/7374 ( |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 02/Nov/13 ] |
|
After I found similar problem with sanity-hsm/test_31c during test session https://maloo.whamcloud.com/test_sets/bf8b7f8a-4396-11e3-942a-52540035b04c, I wanted to create a new ticket but since I thought that both sub-tests 31a/31b may have experienced the same problem, after some search I 1st found The issue seems to come from the same sub-function restore_and_check_size() all, and only, these 3 sub-tests are using to check the restore of a big file : restore_and_check_size() {
local f=$1
local fid=$2
local s=$(stat -c "%s" $f)
local n=$s
local st=$(get_hsm_flags $f)
local err=0
local cpt=0
$LFS hsm_restore $f
while [[ "$st" != "0x00000009" && $cpt -le 10 ]]
do
n=$(stat -c "%s" $f)
# we echo in both cases to show stat is not
# hang
if [[ $n != $s ]]; then
echo "size seen is $n != $s"
err=1
else
echo "size seen is right: $n == $s"
fi
st=$(get_hsm_flags $f)
sleep 10
cpt=$((cpt + 1))
done
if [[ $cpt -lt 10 ]]; then
echo " restore is too long"
else
echo " "done
fi
wait_request_state $fid RESTORE SUCCEED
return $err
}
and if we look at the tests error logs they all look similar and here is 31a log for the error that caused this ticket creation : == sanity-hsm test 31a: Import a large file and check size during restore == 22:35:58 (1383024958)
CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-10vm5
CMD: wtm-10vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/*
Starting copytool agt1 on wtm-10vm5
CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1
CMD: wtm-10vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.copytool_log.wtm-10vm5.log 2>&1
CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31
CMD: wtm-10vm5 dd if=/dev/urandom of=/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a count=32 bs=1000000
wtm-10vm5: 32+0 records in
wtm-10vm5: 32+0 records out
wtm-10vm5: 32000000 bytes (32 MB) copied, 4.95968 s, 6.5 MB/s
CMD: wtm-10vm5 lhsmtool_posix --archive 2 --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --import d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre
wtm-10vm5: lhsmtool_posix[23455]: action=1 src=d0.sanity-hsm/d31/f.sanity-hsm.31a dst=/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a mount_point=/mnt/lustre
wtm-10vm5: lhsmtool_posix[23455]: importing '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
wtm-10vm5: lhsmtool_posix[23455]: imported '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/00f2/0000/0402/0000/0002/0000/0x200000402:0xf2:0x0'=='/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
wtm-10vm5: lhsmtool_posix[23455]: process finished, errs: 0 major, 0 minor, rc=0 (Success)
CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
Wakeup copytool agt1 on wtm-10vm5
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
restore is too long
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
…………. <<<< spending the whole 100s printing the same
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000402:0xf2:0x0] is not SUCCEED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4320:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1907:test_31a()
= /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1914:main()
Dumping lctl log to /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.*.1383025137.log
CMD: wtm-10vm3,wtm-10vm4,wtm-10vm5,wtm-10vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.debug_log.\$(hostname -s).1383025137.log;
dmesg > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.dmesg.\$(hostname -s).1383025137.log
CMD: wtm-10vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-10vm5
here is the 31b log for the error in == sanity-hsm test 31b: Restore a large unaligned file and check size during restore == 13:50:02 (1379451002)
CMD: wtm-14vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-14vm5
CMD: wtm-14vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1/*
Starting copytool agt1 on wtm-14vm5
CMD: wtm-14vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1
CMD: wtm-14vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.copytool_log.wtm-14vm5.log 2>&1
39+0 records in
39+0 records out
39000000 bytes (39 MB) copied, 15.8537 s, 2.5 MB/s
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
…………… <<<< spending the whole 100s printing the same
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
sanity-hsm test_31b: @@@@@@ FAIL: request on 0x200008101:0x21:0x0 is not SUCCEED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4266:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4293:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:444:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1696:restore_and_check_size()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1734:test_31b()
= /usr/lib64/lustre/tests/test-framework.sh:4547:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4580:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4435:run_test()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1741:main()
Dumping lctl log to /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.*.1379451241.log
CMD: wtm-14vm3,wtm-14vm4,wtm-14vm5,wtm-14vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.debug_log.\$(hostname -s).1379451241.log;
dmesg > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.dmesg.\$(hostname -s).1379451241.log
CMD: wtm-14vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-14vm5
and here is the 31c log for the error I just reported : == sanity-hsm test 31c: Restore a large aligned file and check size during restore == 18:03:34 (1383354214)
CMD: wtm-19vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-19vm5
CMD: wtm-19vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1/*
Starting copytool agt1 on wtm-19vm5
CMD: wtm-19vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1
CMD: wtm-19vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.copytool_log.wtm-19vm5.log 2>&1
/usr/lib64/lustre/tests/sanity-hsm.sh: line 402: [: /mnt/lustre: integer expression expected
33+0 records in
33+0 records out
34603008 bytes (35 MB) copied, 5.01832 s, 6.9 MB/s
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 80 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Updated after 29s: wanted 'SUCCEED' got 'SUCCEED'
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
restore is too long
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
…………… <<<< spending the whole 100s printing the same
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000401:0x28:0x0 is not SUCCEED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4320:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1950:test_31c()
= /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
= /usr/lib64/lustre/tests/sanity-hsm.sh:1957:main()
Dumping lctl log to /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.*.1383354428.log
CMD: wtm-19vm3,wtm-19vm4,wtm-19vm5,wtm-19vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.debug_log.\$(hostname -s).1383354428.log;
dmesg > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.dmesg.\$(hostname -s).1383354428.log
CMD: wtm-19vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-19vm5
This clearly shows that : _ restore_and_check_size() must be fixed to exchange "too long" and "done" outputs to reflect the real restore status. _ the following+common "wait_request_state $fid RESTORE SUCCEED" may only be required for the "too long" case, since it seems to be unnecessary and buggy if the RESTORE/SUCCESS record has already been deleted or never written ??… This is what seems to happen in all 3 cases where it is obvious that the file hsm_flags as it HSM_RELEASED bit cleared, meaning the restore was completed. _ thus a possible way to fix could be to only "wait_request_state $fid RESTORE SUCCEED" for "too long" restore case. An other would be to ensure llog RESTORE/SUCCESS record is written/updated (which should not occur only upon request timeout or upon some llog layer error) and will be kept long enough. |
| Comment by jacques-charles lafoucriere [ 04/Nov/13 ] |
|
As the function tests the hsm flags, and trigs error if not as expected, we can remove the wait_request_state $fid RESTORE SUCCEED, so this will manage the "deleted record" case |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ] |
|
Submitted patch http://review.whamcloud.com/8154. |
| Comment by Jodi Levi (Inactive) [ 14/Nov/13 ] |
|
Patch landed to Master. IF more work is needed in this ticket, let m know and I'll reopen this ticket. |
| Comment by Jian Yu [ 28/Nov/13 ] |
|
The same failure occurred on Lustre b2_5 branch: |
| Comment by James Nunez (Inactive) [ 03/Jan/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/8720/ |
| Comment by Jian Yu [ 04/Jan/14 ] |
|
The same failure occurred on Lustre b2_5 branch again: |
| Comment by Bob Glossman (Inactive) [ 07/Jan/14 ] |
|
seen again in b2_5: |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ] |
|
Even if patches/fixes have been landed for both b2_5 and master, test_31a (only!! and surprisingly not test_31b and test_31c also when they did trigger this same problem ...) is still disabled respectively after patches #9134 and #8122 (for Patch for master to re-enable test_31a is at http://review.whamcloud.com/9577. |
| Comment by Andreas Dilger [ 02/Oct/14 ] |
|
Reopen bug until patch to enable test is landed. |
| Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ] |
|
It just has, so do we need to wait some more and verify its behavior in auto-tests before to close ? |
| Comment by Bruno Faccini (Inactive) [ 22/Jan/15 ] |
|
Andreas, do you agree than we can close this ticket, now that test_31[a,b,c] have been all re-enabled within master since November and none of these 3 sub-tests have reported the failure addressed in this ticket until now? |
| Comment by Gerrit Updater [ 05/Feb/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13662 |