Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4613

Failure on test suite sanity-hsm test_12o: request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.6.0, Lustre 2.5.1
    • client and server: lustre-master build # 1876
      client is SLES11 SP3
    • 3
    • 12623

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6ac389ea-90ce-11e3-91ee-52540035b04c.

      The sub-test test_12o failed with the following error:

      request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1

      Info required for matching: sanity-hsm 12o

      Attachments

        Activity

          [LU-4613] Failure on test suite sanity-hsm test_12o: request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1
          pjones Peter Jones added a comment -

          Landed for 2.5.1 and 2.6

          pjones Peter Jones added a comment - Landed for 2.5.1 and 2.6
          bogl Bob Glossman (Inactive) added a comment - hit again: https://maloo.whamcloud.com/test_sets/d436c73c-9855-11e3-8828-52540035b04c
          yujian Jian Yu added a comment -

          Patch for b2_5 branch: http://review.whamcloud.com/9295

          yujian Jian Yu added a comment - Patch for b2_5 branch: http://review.whamcloud.com/9295
          yujian Jian Yu added a comment -

          While testing patch http://review.whamcloud.com/9288 on Lustre b2_5 branch, the same failure occurred:
          https://maloo.whamcloud.com/test_sets/a20caf5e-984d-11e3-8a38-52540035b04c

          As per Bruno, this regression failure was introduced by the patch for LU-3834 landed on Lustre b2_5 build #25.

          So, the failure is also needed to be fixed on Lustre b2_5 branch.

          yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/9288 on Lustre b2_5 branch, the same failure occurred: https://maloo.whamcloud.com/test_sets/a20caf5e-984d-11e3-8a38-52540035b04c As per Bruno, this regression failure was introduced by the patch for LU-3834 landed on Lustre b2_5 build #25. So, the failure is also needed to be fixed on Lustre b2_5 branch.
          bfaccini Bruno Faccini (Inactive) added a comment - Patch is at http://review.whamcloud.com/9235 .

          +1 for https://maloo.whamcloud.com/test_sets/e17437d2-934f-11e3-9f1b-52540035b04c, during auto-tests session for patch http://review.whamcloud.com/9212 (b2_5 version for LU-3834).

          sanity-hsm/test_12o (I introduced for LU-3834 !!…) reports a failure waiting for the success of the 2nd diff command (ie, when fault-injection of layouts-swap has been disabled) for the same file/FID, and we can see the following traces in the tests log :

          ………
          
          CMD: client-12vm3 lctl set_param fail_loc=0x152
          fail_loc=0x152
          CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
          mdt.lustre-MDT0000.hsm.policy=+NRA
          diff: /mnt/lustre/d12o.sanity-hsm/f12o.sanity-hsm: No data available
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=-NRA
          mdt.lustre-MDT0000.hsm.policy=-NRA
          CMD: client-12vm3 lctl set_param fail_loc=0
          fail_loc=0
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 100 secs for update
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Waiting 90 secs for update
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          Changed after 17s: from 'FAILED
          SUCCEED' to ''
          CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
          
          ……….
          
          

          so, seems that since it searches for hsm operation result for the same FID than a previous failed operation that has may not have been purge from the log, it gets puzzled when receiving both old/new results!

          Here is sanity-hsm/test_12o current code :

          test_12o() {
                  # test needs a running copytool
                  copytool_setup
          
                  mkdir -p $DIR/$tdir
                  local f=$DIR/$tdir/$tfile
                  local fid=$(copy_file /etc/hosts $f)
          
                  $LFS hsm_archive --archive $HSM_ARCHIVE_NUMBER $f
                  wait_request_state $fid ARCHIVE SUCCEED
                  $LFS hsm_release $f || error "release of $f failed"
          
          #define OBD_FAIL_MDS_HSM_SWAP_LAYOUTS           0x152
                  do_facet $SINGLEMDS lctl set_param fail_loc=0x152
          
                  # set no retry action mode
                  cdt_set_no_retry
          
                  diff -q /etc/hosts $f
                  local st=$?
          
                  # we check we had a restore failure
                  wait_request_state $fid RESTORE FAILED
          
                  [[ $st -eq 0 ]] && error "Restore must fail"
          
                  # remove no retry action mode
                  cdt_clear_no_retry
          
                  # check file is still released
                  check_hsm_flags $f "0x0000000d"
          
                  # retry w/o failure injection
                  do_facet $SINGLEMDS lctl set_param fail_loc=0
          
                  diff -q /etc/hosts $f
                  st=$?
          
                  # we check we had a restore done
                  wait_request_state $fid RESTORE SUCCEED
          
                  [[ $st -eq 0 ]] || error "Restored file differs"
          
                  copytool_cleanup
          }
          run_test 12o "Layout-swap failure during Restore leaves file released"
          

          I will push a patch to introduce a "cdt_purge + wait_for_grace_delay" between both the failing (due to fault-injection) and succeeding (normal conditions) RESTORE attempts, to allow old/1st result to disapear from the log. Current grace_delay is set to 10s by sanity-hsm which is too long between both RESTORE requests here.

          bfaccini Bruno Faccini (Inactive) added a comment - +1 for https://maloo.whamcloud.com/test_sets/e17437d2-934f-11e3-9f1b-52540035b04c , during auto-tests session for patch http://review.whamcloud.com/9212 (b2_5 version for LU-3834 ). sanity-hsm/test_12o (I introduced for LU-3834 !!…) reports a failure waiting for the success of the 2nd diff command (ie, when fault-injection of layouts-swap has been disabled) for the same file/FID, and we can see the following traces in the tests log : ……… CMD: client-12vm3 lctl set_param fail_loc=0x152 fail_loc=0x152 CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA mdt.lustre-MDT0000.hsm.policy=+NRA diff: /mnt/lustre/d12o.sanity-hsm/f12o.sanity-hsm: No data available CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=-NRA mdt.lustre-MDT0000.hsm.policy=-NRA CMD: client-12vm3 lctl set_param fail_loc=0 fail_loc=0 CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 100 secs for update CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Waiting 90 secs for update CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Changed after 17s: from 'FAILED SUCCEED' to '' CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= ………. so, seems that since it searches for hsm operation result for the same FID than a previous failed operation that has may not have been purge from the log, it gets puzzled when receiving both old/new results! Here is sanity-hsm/test_12o current code : test_12o() { # test needs a running copytool copytool_setup mkdir -p $DIR/$tdir local f=$DIR/$tdir/$tfile local fid=$(copy_file /etc/hosts $f) $LFS hsm_archive --archive $HSM_ARCHIVE_NUMBER $f wait_request_state $fid ARCHIVE SUCCEED $LFS hsm_release $f || error "release of $f failed" #define OBD_FAIL_MDS_HSM_SWAP_LAYOUTS 0x152 do_facet $SINGLEMDS lctl set_param fail_loc=0x152 # set no retry action mode cdt_set_no_retry diff -q /etc/hosts $f local st=$? # we check we had a restore failure wait_request_state $fid RESTORE FAILED [[ $st -eq 0 ]] && error "Restore must fail" # remove no retry action mode cdt_clear_no_retry # check file is still released check_hsm_flags $f "0x0000000d" # retry w/o failure injection do_facet $SINGLEMDS lctl set_param fail_loc=0 diff -q /etc/hosts $f st=$? # we check we had a restore done wait_request_state $fid RESTORE SUCCEED [[ $st -eq 0 ]] || error "Restored file differs" copytool_cleanup } run_test 12o "Layout-swap failure during Restore leaves file released" I will push a patch to introduce a "cdt_purge + wait_for_grace_delay" between both the failing (due to fault-injection) and succeeding (normal conditions) RESTORE attempts, to allow old/1st result to disapear from the log. Current grace_delay is set to 10s by sanity-hsm which is too long between both RESTORE requests here.

          People

            bfaccini Bruno Faccini (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: