[LU-4176] Test failure on test suite sanity-hsm, subtest test_31a Created: 29/Oct/13  Updated: 03/Apr/16  Resolved: 23/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Severity: 3
Rank (Obsolete): 11307

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6bbd2b34-407a-11e3-af95-52540035b04c.

The sub-test test_31a failed with the following error:

request on [0x200000402:0xf2:0x0] is not SUCCEED

Info required for matching: sanity-hsm 31a

This test was EXCEPT'd prior to http://review.whamcloud.com/7374 (LU-3815) re-enabled most of the EXCEPT'd tests.



 Comments   
Comment by Bruno Faccini (Inactive) [ 02/Nov/13 ]

After I found similar problem with sanity-hsm/test_31c during test session https://maloo.whamcloud.com/test_sets/bf8b7f8a-4396-11e3-942a-52540035b04c, I wanted to create a new ticket but since I thought that both sub-tests 31a/31b may have experienced the same problem, after some search I 1st found LU-3969 for test_31b similar problem, which has been dup'ed to LU-3761 with no real conclusion about the exact scenario, and also found this one for same error with test_31a !!

The issue seems to come from the same sub-function restore_and_check_size() all, and only, these 3 sub-tests are using to check the restore of a big file :

restore_and_check_size() {
        local f=$1
        local fid=$2
        local s=$(stat -c "%s" $f)
        local n=$s
        local st=$(get_hsm_flags $f)
        local err=0
        local cpt=0
        $LFS hsm_restore $f
        while [[ "$st" != "0x00000009" && $cpt -le 10 ]]
        do
                n=$(stat -c "%s" $f)
                # we echo in both cases to show stat is not
                # hang
                if [[ $n != $s ]]; then
                        echo "size seen is $n != $s"
                        err=1
                else
                        echo "size seen is right: $n == $s"
                fi
                st=$(get_hsm_flags $f)
                sleep 10
                cpt=$((cpt + 1))
        done
        if [[ $cpt -lt 10 ]]; then
                echo " restore is too long"
        else
                echo " "done
        fi
        wait_request_state $fid RESTORE SUCCEED
        return $err
}

and if we look at the tests error logs they all look similar and here is 31a log for the error that caused this ticket creation :

== sanity-hsm test 31a: Import a large file and check size during restore == 22:35:58 (1383024958)
CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-10vm5
CMD: wtm-10vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/*
Starting copytool agt1 on wtm-10vm5
CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1
CMD: wtm-10vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.copytool_log.wtm-10vm5.log 2>&1
CMD: wtm-10vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31
CMD: wtm-10vm5 dd if=/dev/urandom of=/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a count=32 bs=1000000
wtm-10vm5: 32+0 records in
wtm-10vm5: 32+0 records out
wtm-10vm5: 32000000 bytes (32 MB) copied, 4.95968 s, 6.5 MB/s
CMD: wtm-10vm5 lhsmtool_posix --archive 2 --hsm-root /home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1		--import d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a /mnt/lustre
wtm-10vm5: lhsmtool_posix[23455]: action=1 src=d0.sanity-hsm/d31/f.sanity-hsm.31a dst=/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a mount_point=/mnt/lustre
wtm-10vm5: lhsmtool_posix[23455]: importing '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
wtm-10vm5: lhsmtool_posix[23455]: imported '/mnt/lustre/d0.sanity-hsm/d31/f.sanity-hsm.31a' from '/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/00f2/0000/0402/0000/0002/0000/0x200000402:0xf2:0x0'=='/home/cgearing/.autotest/shared_dir/2013-10-28/112321-69843391629520/arc1/d0.sanity-hsm/d31/f.sanity-hsm.31a'
wtm-10vm5: lhsmtool_posix[23455]: process finished, errs: 0 major, 0 minor, rc=0 (Success)
CMD: wtm-10vm5 pkill -CONT -x lhsmtool_posix
Wakeup copytool agt1 on wtm-10vm5
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
size seen is right: 32000000 == 32000000
 restore is too long
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=

…………. <<<< spending the whole 100s printing the same

CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-10vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000402:0xf2:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000402:0xf2:0x0] is not SUCCEED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4320:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1907:test_31a()
  = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1914:main()
Dumping lctl log to /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.*.1383025137.log
CMD: wtm-10vm3,wtm-10vm4,wtm-10vm5,wtm-10vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.debug_log.\$(hostname -s).1383025137.log;
         dmesg > /logdir/test_logs/2013-10-28/lustre-reviews-el6-x86_64--review--1_2_1__19069__-69843391629520-112320/sanity-hsm.test_31a.dmesg.\$(hostname -s).1383025137.log
CMD: wtm-10vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-10vm5

here is the 31b log for the error in LU-3969 :

== sanity-hsm test 31b: Restore a large unaligned file and check size during restore == 13:50:02 (1379451002)
CMD: wtm-14vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-14vm5
CMD: wtm-14vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1/*
Starting copytool agt1 on wtm-14vm5
CMD: wtm-14vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1
CMD: wtm-14vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-09-17/024917-69984331480660/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.copytool_log.wtm-14vm5.log 2>&1
39+0 records in
39+0 records out
39000000 bytes (39 MB) copied, 15.8537 s, 2.5 MB/s
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=

…………… <<<< spending the whole 100s printing the same

CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-14vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.agent_actions | awk '/'0x200008101:0x21:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_31b: @@@@@@ FAIL: request on 0x200008101:0x21:0x0 is not SUCCEED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4266:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4293:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:444:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1696:restore_and_check_size()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1734:test_31b()
  = /usr/lib64/lustre/tests/test-framework.sh:4547:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4580:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4435:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1741:main()
Dumping lctl log to /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.*.1379451241.log
CMD: wtm-14vm3,wtm-14vm4,wtm-14vm5,wtm-14vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.debug_log.\$(hostname -s).1379451241.log;
         dmesg > /logdir/test_logs/2013-09-17/lustre-reviews-el6-x86_64--review-zfs--1_3_1__18234__-69984331480660-024916/sanity-hsm.test_31b.dmesg.\$(hostname -s).1379451241.log
CMD: wtm-14vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-14vm5

and here is the 31c log for the error I just reported :

== sanity-hsm test 31c: Restore a large aligned file and check size during restore == 18:03:34 (1383354214)
CMD: wtm-19vm5 pkill -CONT -x lhsmtool_posix
Purging archive on wtm-19vm5
CMD: wtm-19vm5 rm -rf /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1/*
Starting copytool agt1 on wtm-19vm5
CMD: wtm-19vm5 mkdir -p /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1
CMD: wtm-19vm5 lhsmtool_posix  --daemon --hsm-root /home/cgearing/.autotest/shared_dir/2013-11-01/053728-70102366359140/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.copytool_log.wtm-19vm5.log 2>&1
/usr/lib64/lustre/tests/sanity-hsm.sh: line 402: [: /mnt/lustre: integer expression expected
33+0 records in
33+0 records out
34603008 bytes (35 MB) copied, 5.01832 s, 6.9 MB/s
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Waiting 80 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Updated after 29s: wanted 'SUCCEED' got 'SUCCEED'
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
size seen is right: 34603008 == 34603008
 restore is too long
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=

…………… <<<< spending the whole 100s printing the same

CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 10 secs for update
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: wtm-19vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000401:0x28:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000401:0x28:0x0 is not SUCCEED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4289:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4320:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:470:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1891:restore_and_check_size()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1950:test_31c()
  = /usr/lib64/lustre/tests/test-framework.sh:4575:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4609:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4463:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1957:main()
Dumping lctl log to /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.*.1383354428.log
CMD: wtm-19vm3,wtm-19vm4,wtm-19vm5,wtm-19vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.debug_log.\$(hostname -s).1383354428.log;
         dmesg > /logdir/test_logs/2013-11-01/lustre-reviews-el6-x86_64--review--1_2_1__19165__-70102366359140-053726/sanity-hsm.test_31c.dmesg.\$(hostname -s).1383354428.log
CMD: wtm-19vm5 pkill -INT -x lhsmtool_posix
Copytool is stopped on wtm-19vm5

This clearly shows that :

_ restore_and_check_size() must be fixed to exchange "too long" and "done" outputs to reflect the real restore status.

_ the following+common "wait_request_state $fid RESTORE SUCCEED" may only be required for the "too long" case, since it seems to be unnecessary and buggy if the RESTORE/SUCCESS record has already been deleted or never written ??… This is what seems to happen in all 3 cases where it is obvious that the file hsm_flags as it HSM_RELEASED bit cleared, meaning the restore was completed.

_ thus a possible way to fix could be to only "wait_request_state $fid RESTORE SUCCEED" for "too long" restore case. An other would be to ensure llog RESTORE/SUCCESS record is written/updated (which should not occur only upon request timeout or upon some llog layer error) and will be kept long enough.

Comment by jacques-charles lafoucriere [ 04/Nov/13 ]

As the function tests the hsm flags, and trigs error if not as expected, we can remove the wait_request_state $fid RESTORE SUCCEED, so this will manage the "deleted record" case

Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ]

Submitted patch http://review.whamcloud.com/8154.

Comment by Jodi Levi (Inactive) [ 14/Nov/13 ]

Patch landed to Master. IF more work is needed in this ticket, let m know and I'll reopen this ticket.

Comment by Jian Yu [ 28/Nov/13 ]

The same failure occurred on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/4bc47d48-57d4-11e3-919d-52540035b04c

Comment by James Nunez (Inactive) [ 03/Jan/14 ]

Patch for b2_5 at http://review.whamcloud.com/#/c/8720/

Comment by Jian Yu [ 04/Jan/14 ]

The same failure occurred on Lustre b2_5 branch again:
https://maloo.whamcloud.com/test_sets/5d3e042e-74ce-11e3-8ad9-52540035b04c

Comment by Bob Glossman (Inactive) [ 07/Jan/14 ]

seen again in b2_5:
https://maloo.whamcloud.com/test_sets/2181559c-772d-11e3-b181-52540035b04c

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

Even if patches/fixes have been landed for both b2_5 and master, test_31a (only!! and surprisingly not test_31b and test_31c also when they did trigger this same problem ...) is still disabled respectively after patches #9134 and #8122 (for LU-4178) have been landed to these 2 branches.

Patch for master to re-enable test_31a is at http://review.whamcloud.com/9577.

Comment by Andreas Dilger [ 02/Oct/14 ]

Reopen bug until patch to enable test is landed.

Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ]

It just has, so do we need to wait some more and verify its behavior in auto-tests before to close ?

Comment by Bruno Faccini (Inactive) [ 22/Jan/15 ]

Andreas, do you agree than we can close this ticket, now that test_31[a,b,c] have been all re-enabled within master since November and none of these 3 sub-tests have reported the failure addressed in this ticket until now?

Comment by Gerrit Updater [ 05/Feb/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13662
Subject: LU-4176 tests: re-enable sanity-hsm/test_31a
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 5f3b35ef6b1673c7ae5e23cd3c5a00f9cc07b3e5

Generated at Sat Feb 10 01:40:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.