Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4176

Test failure on test suite sanity-hsm, subtest test_31a

Details

    • 3
    • 11307

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6bbd2b34-407a-11e3-af95-52540035b04c.

      The sub-test test_31a failed with the following error:

      request on [0x200000402:0xf2:0x0] is not SUCCEED

      Info required for matching: sanity-hsm 31a

      This test was EXCEPT'd prior to http://review.whamcloud.com/7374 (LU-3815) re-enabled most of the EXCEPT'd tests.

      Attachments

        Activity

          [LU-4176] Test failure on test suite sanity-hsm, subtest test_31a

          Reopen bug until patch to enable test is landed.

          adilger Andreas Dilger added a comment - Reopen bug until patch to enable test is landed.

          Even if patches/fixes have been landed for both b2_5 and master, test_31a (only!! and surprisingly not test_31b and test_31c also when they did trigger this same problem ...) is still disabled respectively after patches #9134 and #8122 (for LU-4178) have been landed to these 2 branches.

          Patch for master to re-enable test_31a is at http://review.whamcloud.com/9577.

          bfaccini Bruno Faccini (Inactive) added a comment - Even if patches/fixes have been landed for both b2_5 and master, test_31a (only!! and surprisingly not test_31b and test_31c also when they did trigger this same problem ...) is still disabled respectively after patches #9134 and #8122 (for LU-4178 ) have been landed to these 2 branches. Patch for master to re-enable test_31a is at http://review.whamcloud.com/9577 .
          bogl Bob Glossman (Inactive) added a comment - seen again in b2_5: https://maloo.whamcloud.com/test_sets/2181559c-772d-11e3-b181-52540035b04c
          yujian Jian Yu added a comment -

          The same failure occurred on Lustre b2_5 branch again:
          https://maloo.whamcloud.com/test_sets/5d3e042e-74ce-11e3-8ad9-52540035b04c

          yujian Jian Yu added a comment - The same failure occurred on Lustre b2_5 branch again: https://maloo.whamcloud.com/test_sets/5d3e042e-74ce-11e3-8ad9-52540035b04c
          jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/8720/
          yujian Jian Yu added a comment -

          The same failure occurred on Lustre b2_5 branch:
          https://maloo.whamcloud.com/test_sets/4bc47d48-57d4-11e3-919d-52540035b04c

          yujian Jian Yu added a comment - The same failure occurred on Lustre b2_5 branch: https://maloo.whamcloud.com/test_sets/4bc47d48-57d4-11e3-919d-52540035b04c

          Patch landed to Master. IF more work is needed in this ticket, let m know and I'll reopen this ticket.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. IF more work is needed in this ticket, let m know and I'll reopen this ticket.
          bfaccini Bruno Faccini (Inactive) added a comment - Submitted patch http://review.whamcloud.com/8154 .

          As the function tests the hsm flags, and trigs error if not as expected, we can remove the wait_request_state $fid RESTORE SUCCEED, so this will manage the "deleted record" case

          jcl jacques-charles lafoucriere added a comment - As the function tests the hsm flags, and trigs error if not as expected, we can remove the wait_request_state $fid RESTORE SUCCEED, so this will manage the "deleted record" case
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          After I found similar problem with sanity-hsm/test_31c during test session https://maloo.whamcloud.com/test_sets/bf8b7f8a-4396-11e3-942a-52540035b04c, I wanted to create a new ticket but since I thought that both sub-tests 31a/31b may have experienced the same problem, after some search I 1st found LU-3969 for test_31b similar problem, which has been dup'ed to LU-3761 with no real conclusion about the exact scenario, and also found this one for same error with test_31a !!

          The issue seems to come from the same sub-function restore_and_check_size() all, and only, these 3 sub-tests are using to check the restore of a big file :

          restore_and_check_size() {
                  local f=$1
                  local fid=$2
                  local s=$(stat -c "%s" $f)
                  local n=$s
                  local st=$(get_hsm_flags $f)
                  local err=0
                  local cpt=0
                  $LFS hsm_restore $f
                  while [[ "$st" != "0x00000009" && $cpt -le 10 ]]
                  do
                          n=$(stat -c "%s" $f)
                          # we echo in both cases to show stat is not
                          # hang
                          if [[ $n != $s ]]; then
                                  echo "size seen is $n != $s"
                                  err=1
                          else
                                  echo "size seen is right: $n == $s"
                          fi
                          st=$(get_hsm_flags $f)
                          sleep 10
                          cpt=$((cpt + 1))
                  done
                  if [[ $cpt -lt 10 ]]; then
                          echo " restore is too long"
                  else
                          echo " "done
                  fi
                  wait_request_state $fid RESTORE SUCCEED
                  return $err
          }
          

          and if we look at the tests error logs they all look similar and here is 31a log for the error that caused this ticket creation :

          == sanity-hsm test 31a: Import a large file and check size during restore == 22:35:58 (1383024958)
          CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
          Purging archive on wtm-10vm5
          CMD: wtm-10vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/*
          Starting copytool agt1 on wtm-10vm5
          CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1
          CMD: wtm-10vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.copytool_log.wtm-10vm5.log 2>&1
          CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31
          CMD: wtm-10vm5 dd if=/dev/urandom of=/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a count=32 bs=1000000
          wtm-10vm5: 32+0 records in
          wtm-10vm5: 32+0 records out
          wtm-10vm5: 32000000 bytes (32 MB) copied, 4.95968 s, 6.5 MB/s
          CMD: wtm-10vm5 lhsmtool_posix --archive 2 --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1		--import d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre
          wtm-10vm5: lhsmtool_posix[23455]: action=1 src=d0.sanity-hsm/d31/f.sanity-hsm.31a dst=/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a mount_point=/mnt/lustre
          wtm-10vm5: lhsmtool_posix[23455]: importing '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
          wtm-10vm5: lhsmtool_posix[23455]: imported '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/00f2/0000/0402/0000/0002/0000/0x200000402:0xf2:0x0'=='/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
          wtm-10vm5: lhsmtool_posix[23455]: process finished, errs: 0 major, 0 minor, rc=0 (Success)
          CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
          Wakeup copytool agt1 on wtm-10vm5
          size seen is right: 32000000 == 32000000
          size seen is right: 32000000 == 32000000
          size seen is right: 32000000 == 32000000
          size seen is right: 32000000 == 32000000
          size seen is right: 32000000 == 32000000
           restore is too long
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 100 secs for update
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 90 secs for update
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          
          …………. <<<< spending the whole 100s printing the same
          
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 10 secs for update
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Update not seen after 100s: wanted 'SUCCEED' got ''
           sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000402:0xf2:0x0] is not SUCCEED 
            Trace dump:
            = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
            = /usr/lib64/lustre/tests/test-framework.sh:4320:error()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1907:test_31a()
            = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
            = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
            = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1914:main()
          Dumping lctl log to /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.*.1383025137.log
          CMD: wtm-10vm3,wtm-10vm4,wtm-10vm5,wtm-10vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.debug_log.\$(hostname -s).1383025137.log;
                   dmesg > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.dmesg.\$(hostname -s).1383025137.log
          CMD: wtm-10vm5 pkill -INT -x lhsmtool_posix
          Copytool is stopped on wtm-10vm5
          

          here is the 31b log for the error in LU-3969 :

          == sanity-hsm test 31b: Restore a large unaligned file and check size during restore == 13:50:02 (1379451002)
          CMD: wtm-14vm5 pkill -CONT -x lhsmtool_posix
          Purging archive on wtm-14vm5
          CMD: wtm-14vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1/*
          Starting copytool agt1 on wtm-14vm5
          CMD: wtm-14vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1
          CMD: wtm-14vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.copytool_log.wtm-14vm5.log 2>&1
          39+0 records in
          39+0 records out
          39000000 bytes (39 MB) copied, 15.8537 s, 2.5 MB/s
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Changed after 0s: from '' to 'STARTED'
          Waiting 100 secs for update
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Waiting 90 secs for update
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          
          …………… <<<< spending the whole 100s printing the same
          
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 10 secs for update
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Update not seen after 100s: wanted 'SUCCEED' got ''
           sanity-hsm test_31b: @@@@@@ FAIL: request on 0x200008101:0x21:0x0 is not SUCCEED 
            Trace dump:
            = /usr/lib64/lustre/tests/test-framework.sh:4266:error_noexit()
            = /usr/lib64/lustre/tests/test-framework.sh:4293:error()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:444:wait_request_state()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1696:restore_and_check_size()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1734:test_31b()
            = /usr/lib64/lustre/tests/test-framework.sh:4547:run_one()
            = /usr/lib64/lustre/tests/test-framework.sh:4580:run_one_logged()
            = /usr/lib64/lustre/tests/test-framework.sh:4435:run_test()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1741:main()
          Dumping lctl log to /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.*.1379451241.log
          CMD: wtm-14vm3,wtm-14vm4,wtm-14vm5,wtm-14vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.debug_log.\$(hostname -s).1379451241.log;
                   dmesg > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.dmesg.\$(hostname -s).1379451241.log
          CMD: wtm-14vm5 pkill -INT -x lhsmtool_posix
          Copytool is stopped on wtm-14vm5
          

          and here is the 31c log for the error I just reported :

          == sanity-hsm test 31c: Restore a large aligned file and check size during restore == 18:03:34 (1383354214)
          CMD: wtm-19vm5 pkill -CONT -x lhsmtool_posix
          Purging archive on wtm-19vm5
          CMD: wtm-19vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1/*
          Starting copytool agt1 on wtm-19vm5
          CMD: wtm-19vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1
          CMD: wtm-19vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.copytool_log.wtm-19vm5.log 2>&1
          /usr/lib64/lustre/tests/sanity-hsm.sh: line 402: [: /mnt/lustre: integer expression expected
          33+0 records in
          33+0 records out
          34603008 bytes (35 MB) copied, 5.01832 s, 6.9 MB/s
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Changed after 0s: from '' to 'STARTED'
          Waiting 100 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Waiting 90 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Waiting 80 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
          Updated after 29s: wanted 'SUCCEED' got 'SUCCEED'
          size seen is right: 34603008 == 34603008
          size seen is right: 34603008 == 34603008
          size seen is right: 34603008 == 34603008
          size seen is right: 34603008 == 34603008
          size seen is right: 34603008 == 34603008
           restore is too long
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 100 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 90 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          
          …………… <<<< spending the whole 100s printing the same
          
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 10 secs for update
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Update not seen after 100s: wanted 'SUCCEED' got ''
           sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000401:0x28:0x0 is not SUCCEED 
            Trace dump:
            = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
            = /usr/lib64/lustre/tests/test-framework.sh:4320:error()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1950:test_31c()
            = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
            = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
            = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
            = /usr/lib64/lustre/tests/sanity-hsm.sh:1957:main()
          Dumping lctl log to /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.*.1383354428.log
          CMD: wtm-19vm3,wtm-19vm4,wtm-19vm5,wtm-19vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.debug_log.\$(hostname -s).1383354428.log;
                   dmesg > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.dmesg.\$(hostname -s).1383354428.log
          CMD: wtm-19vm5 pkill -INT -x lhsmtool_posix
          Copytool is stopped on wtm-19vm5
          

          This clearly shows that :

          _ restore_and_check_size() must be fixed to exchange "too long" and "done" outputs to reflect the real restore status.

          _ the following+common "wait_request_state $fid RESTORE SUCCEED" may only be required for the "too long" case, since it seems to be unnecessary and buggy if the RESTORE/SUCCESS record has already been deleted or never written ??… This is what seems to happen in all 3 cases where it is obvious that the file hsm_flags as it HSM_RELEASED bit cleared, meaning the restore was completed.

          _ thus a possible way to fix could be to only "wait_request_state $fid RESTORE SUCCEED" for "too long" restore case. An other would be to ensure llog RESTORE/SUCCESS record is written/updated (which should not occur only upon request timeout or upon some llog layer error) and will be kept long enough.

          bfaccini Bruno Faccini (Inactive) added a comment - - edited After I found similar problem with sanity-hsm/test_31c during test session https://maloo.whamcloud.com/test_sets/bf8b7f8a-4396-11e3-942a-52540035b04c , I wanted to create a new ticket but since I thought that both sub-tests 31a/31b may have experienced the same problem, after some search I 1st found LU-3969 for test_31b similar problem, which has been dup'ed to LU-3761 with no real conclusion about the exact scenario, and also found this one for same error with test_31a !! The issue seems to come from the same sub-function restore_and_check_size() all, and only, these 3 sub-tests are using to check the restore of a big file : restore_and_check_size() { local f=$1 local fid=$2 local s=$(stat -c "%s" $f) local n=$s local st=$(get_hsm_flags $f) local err=0 local cpt=0 $LFS hsm_restore $f while [[ "$st" != "0x00000009" && $cpt -le 10 ]] do n=$(stat -c "%s" $f) # we echo in both cases to show stat is not # hang if [[ $n != $s ]]; then echo "size seen is $n != $s" err=1 else echo "size seen is right: $n == $s" fi st=$(get_hsm_flags $f) sleep 10 cpt=$((cpt + 1)) done if [[ $cpt -lt 10 ]]; then echo " restore is too long" else echo " "done fi wait_request_state $fid RESTORE SUCCEED return $err } and if we look at the tests error logs they all look similar and here is 31a log for the error that caused this ticket creation : == sanity-hsm test 31a: Import a large file and check size during restore == 22:35:58 (1383024958) CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-10vm5 CMD: wtm-10vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/* Starting copytool agt1 on wtm-10vm5 CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 CMD: wtm-10vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.copytool_log.wtm-10vm5.log 2>&1 CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31 CMD: wtm-10vm5 dd if=/dev/urandom of=/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a count=32 bs=1000000 wtm-10vm5: 32+0 records in wtm-10vm5: 32+0 records out wtm-10vm5: 32000000 bytes (32 MB) copied, 4.95968 s, 6.5 MB/s CMD: wtm-10vm5 lhsmtool_posix --archive 2 --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --import d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre wtm-10vm5: lhsmtool_posix[23455]: action=1 src=d0.sanity-hsm/d31/f.sanity-hsm.31a dst=/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a mount_point=/mnt/lustre wtm-10vm5: lhsmtool_posix[23455]: importing '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a' wtm-10vm5: lhsmtool_posix[23455]: imported '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/00f2/0000/0402/0000/0002/0000/0x200000402:0xf2:0x0'=='/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a' wtm-10vm5: lhsmtool_posix[23455]: process finished, errs: 0 major, 0 minor, rc=0 (Success) CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix Wakeup copytool agt1 on wtm-10vm5 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 size seen is right: 32000000 == 32000000 restore is too long CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 100 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= …………. <<<< spending the whole 100s printing the same CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000402:0xf2:0x0] is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4320:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1907:test_31a() = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1914:main() Dumping lctl log to /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.*.1383025137.log CMD: wtm-10vm3,wtm-10vm4,wtm-10vm5,wtm-10vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.debug_log.\$(hostname -s).1383025137.log; dmesg > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.dmesg.\$(hostname -s).1383025137.log CMD: wtm-10vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-10vm5 here is the 31b log for the error in LU-3969 : == sanity-hsm test 31b: Restore a large unaligned file and check size during restore == 13:50:02 (1379451002) CMD: wtm-14vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-14vm5 CMD: wtm-14vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1/* Starting copytool agt1 on wtm-14vm5 CMD: wtm-14vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 CMD: wtm-14vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.copytool_log.wtm-14vm5.log 2>&1 39+0 records in 39+0 records out 39000000 bytes (39 MB) copied, 15.8537 s, 2.5 MB/s CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Changed after 0s: from '' to 'STARTED' Waiting 100 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= …………… <<<< spending the whole 100s printing the same CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31b: @@@@@@ FAIL: request on 0x200008101:0x21:0x0 is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4266:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4293:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:444:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1696:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1734:test_31b() = /usr/lib64/lustre/tests/test-framework.sh:4547:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4580:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4435:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1741:main() Dumping lctl log to /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.*.1379451241.log CMD: wtm-14vm3,wtm-14vm4,wtm-14vm5,wtm-14vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.debug_log.\$(hostname -s).1379451241.log; dmesg > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.dmesg.\$(hostname -s).1379451241.log CMD: wtm-14vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-14vm5 and here is the 31c log for the error I just reported : == sanity-hsm test 31c: Restore a large aligned file and check size during restore == 18:03:34 (1383354214) CMD: wtm-19vm5 pkill -CONT -x lhsmtool_posix Purging archive on wtm-19vm5 CMD: wtm-19vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1/* Starting copytool agt1 on wtm-19vm5 CMD: wtm-19vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 CMD: wtm-19vm5 lhsmtool_posix --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.copytool_log.wtm-19vm5.log 2>&1 /usr/lib64/lustre/tests/sanity-hsm.sh: line 402: [: /mnt/lustre: integer expression expected 33+0 records in 33+0 records out 34603008 bytes (35 MB) copied, 5.01832 s, 6.9 MB/s CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Changed after 0s: from '' to 'STARTED' Waiting 100 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Waiting 80 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d= Updated after 29s: wanted 'SUCCEED' got 'SUCCEED' size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 size seen is right: 34603008 == 34603008 restore is too long CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 100 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= …………… <<<< spending the whole 100s printing the same CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 10 secs for update CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000401:0x28:0x0 is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4320:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size() = /usr/lib64/lustre/tests/sanity-hsm.sh:1950:test_31c() = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:1957:main() Dumping lctl log to /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.*.1383354428.log CMD: wtm-19vm3,wtm-19vm4,wtm-19vm5,wtm-19vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.debug_log.\$(hostname -s).1383354428.log; dmesg > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.dmesg.\$(hostname -s).1383354428.log CMD: wtm-19vm5 pkill -INT -x lhsmtool_posix Copytool is stopped on wtm-19vm5 This clearly shows that : _ restore_and_check_size() must be fixed to exchange "too long" and "done" outputs to reflect the real restore status. _ the following+common "wait_request_state $fid RESTORE SUCCEED" may only be required for the "too long" case, since it seems to be unnecessary and buggy if the RESTORE/SUCCESS record has already been deleted or never written ??… This is what seems to happen in all 3 cases where it is obvious that the file hsm_flags as it HSM_RELEASED bit cleared, meaning the restore was completed. _ thus a possible way to fix could be to only "wait_request_state $fid RESTORE SUCCEED" for "too long" restore case. An other would be to ensure llog RESTORE/SUCCESS record is written/updated (which should not occur only upon request timeout or upon some llog layer error) and will be kept long enough.

          People

            bfaccini Bruno Faccini (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: