Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.6.0
-
3
-
11307
Description
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>
This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6bbd2b34-407a-11e3-af95-52540035b04c.
The sub-test test_31a failed with the following error:
request on [0x200000402:0xf2:0x0] is not SUCCEED
Info required for matching: sanity-hsm 31a
This test was EXCEPT'd prior to http://review.whamcloud.com/7374 (LU-3815) re-enabled most of the EXCEPT'd tests.
Attachments
Activity
Even if patches/fixes have been landed for both b2_5 and master, test_31a (only!! and surprisingly not test_31b and test_31c also when they did trigger this same problem ...) is still disabled respectively after patches #9134 and #8122 (for LU-4178) have been landed to these 2 branches.
Patch for master to re-enable test_31a is at http://review.whamcloud.com/9577.
seen again in b2_5:
https://maloo.whamcloud.com/test_sets/2181559c-772d-11e3-b181-52540035b04c
The same failure occurred on Lustre b2_5 branch again:
https://maloo.whamcloud.com/test_sets/5d3e042e-74ce-11e3-8ad9-52540035b04c
The same failure occurred on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/4bc47d48-57d4-11e3-919d-52540035b04c
Patch landed to Master. IF more work is needed in this ticket, let m know and I'll reopen this ticket.
As the function tests the hsm flags, and trigs error if not as expected, we can remove the wait_request_state $fid RESTORE SUCCEED, so this will manage the "deleted record" case
After I found similar problem with sanity-hsm/test_31c during test session https://maloo.whamcloud.com/test_sets/bf8b7f8a-4396-11e3-942a-52540035b04c, I wanted to create a new ticket but since I thought that both sub-tests 31a/31b may have experienced the same problem, after some search I 1st found LU-3969 for test_31b similar problem, which has been dup'ed to LU-3761 with no real conclusion about the exact scenario, and also found this one for same error with test_31a !!
The issue seems to come from the same sub-function restore_and_check_size() all, and only, these 3 sub-tests are using to check the restore of a big file :
restore_and_check_size() { local f=$1 local fid=$2 local s=$(stat -c "%s" $f) local n=$s local st=$(get_hsm_flags $f) local err=0 local cpt=0 $LFS hsm_restore $f while [[ "$st" != "0x00000009" && $cpt -le 10 ]] do n=$(stat -c "%s" $f) # we echo in both cases to show stat is not # hang if [[ $n != $s ]]; then echo "size seen is $n != $s" err=1 else echo "size seen is right: $n == $s" fi st=$(get_hsm_flags $f) sleep 10 cpt=$((cpt + 1)) done if [[ $cpt -lt 10 ]]; then echo " restore is too long" else echo " "done fi wait_request_state $fid RESTORE SUCCEED return $err }
and if we look at the tests error logs they all look similar and here is 31a log for the error that caused this ticket creation :
== sanity-hsm test 31a: Import a large file and check size during restore == 22:35:58 (1383024958) CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-10vm5 CMD: wtm-10vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/* Starting copytool agt1 on wtm-10vm5 CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 CMD: wtm-10vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.copytool_log.wtm-10vm5.log 2>&1 CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31 CMD: wtm-10vm5 dd if=/dev/urandom of=/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a count=32 bs=1000000 wtm-10vm5: 32+0 records in wtm-10vm5: 32+0 records out wtm-10vm5: 32000000 bytes (32 MB) copied, 4.95968 s, 6.5 MB/s CMD: wtm-10vm5 lhsmtool_posix --archive 2 --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --import d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre wtm-10vm5: lhsmtool_posix[23455]: action=1 src=d0.sanity-hsm/d31/f.sanity-hsm.31a dst=/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a mount_point=/mnt/lustre wtm-10vm5: lhsmtool_posix[23455]: importing '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a' wtm-10vm5: lhsmtool_posix[23455]: imported '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/00f2/0000/0402/0000/0002/0000/0x200000402:0xf2:0x0'=='/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a' wtm-10vm5: lhsmtool_posix[23455]: process finished, errs: 0 major, 0 minor, rc=0 (Success) CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix Wakeup copytool agt1 on wtm-10vm5 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 restore is too long CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 100 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= …………. <<<< spending the whole 100s printing the same CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000402:0xf2:0x0] is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4320:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1907:test_31a() = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1914:main() Dumping lctl log to /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.*.1383025137.log CMD: wtm-10vm3,wtm-10vm4,wtm-10vm5,wtm-10vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.debug_log.\$(hostname -s).1383025137.log; dmesg > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.dmesg.\$(hostname -s).1383025137.log CMD: wtm-10vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-10vm5
here is the 31b log for the error in LU-3969 :
== sanity-hsm test 31b: Restore a large unaligned file and check size during restore == 13:50:02 (1379451002) CMD: wtm-14vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-14vm5 CMD: wtm-14vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1/* Starting copytool agt1 on wtm-14vm5 CMD: wtm-14vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 CMD: wtm-14vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.copytool_log.wtm-14vm5.log 2>&1 39+0 records in 39+0 records out 39000000 bytes (39 MB) copied, 15.8537 s, 2.5 MB/s CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Changed after 0s: from '' to 'STARTED' Waiting 100 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= …………… <<<< spending the whole 100s printing the same CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31b: @@@@@@ FAIL: request on 0x200008101:0x21:0x0 is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4266:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4293:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:444:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1696:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1734:test_31b() = /usr/lib64/lustre/tests/test-framework.sh:4547:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4580:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4435:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1741:main() Dumping lctl log to /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.*.1379451241.log CMD: wtm-14vm3,wtm-14vm4,wtm-14vm5,wtm-14vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.debug_log.\$(hostname -s).1379451241.log; dmesg > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.dmesg.\$(hostname -s).1379451241.log CMD: wtm-14vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-14vm5
and here is the 31c log for the error I just reported :
== sanity-hsm test 31c: Restore a large aligned file and check size during restore == 18:03:34 (1383354214) CMD: wtm-19vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-19vm5 CMD: wtm-19vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1/* Starting copytool agt1 on wtm-19vm5 CMD: wtm-19vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 CMD: wtm-19vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.copytool_log.wtm-19vm5.log 2>&1 /usr/lib64/lustre/tests/sanity-hsm.sh: line 402: [: /mnt/lustre: integer expression expected 33+0 records in 33+0 records out 34603008 bytes (35 MB) copied, 5.01832 s, 6.9 MB/s CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Changed after 0s: from '' to 'STARTED' Waiting 100 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 80 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Updated after 29s: wanted 'SUCCEED' got 'SUCCEED' size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 restore is too long CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 100 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= …………… <<<< spending the whole 100s printing the same CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000401:0x28:0x0 is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4320:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1950:test_31c() = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1957:main() Dumping lctl log to /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.*.1383354428.log CMD: wtm-19vm3,wtm-19vm4,wtm-19vm5,wtm-19vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.debug_log.\$(hostname -s).1383354428.log; dmesg > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.dmesg.\$(hostname -s).1383354428.log CMD: wtm-19vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-19vm5
This clearly shows that :
_ restore_and_check_size() must be fixed to exchange "too long" and "done" outputs to reflect the real restore status.
_ the following+common "wait_request_state $fid RESTORE SUCCEED" may only be required for the "too long" case, since it seems to be unnecessary and buggy if the RESTORE/SUCCESS record has already been deleted or never written ??… This is what seems to happen in all 3 cases where it is obvious that the file hsm_flags as it HSM_RELEASED bit cleared, meaning the restore was completed.
_ thus a possible way to fix could be to only "wait_request_state $fid RESTORE SUCCEED" for "too long" restore case. An other would be to ensure llog RESTORE/SUCCESS record is written/updated (which should not occur only upon request timeout or upon some llog layer error) and will be kept long enough.
Reopen bug until patch to enable test is landed.