Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7312

sanity-hsm: $? verification is not valid with set -e

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.7.0
    • 3
    • HSM
    • 9223372036854775807

    Description

      The part of the fix for LU-5622

      while (( SECONDS < end_wait )); do
                      sleep 2
                      do_nodesv $agents "pgrep -x $HSMTOOL_BASE"
                      if [ $? -ne 0 ]; then
                              echo "Copytool is stopped on $agents"
                              break
                      fi
                      echo "Copytool still running on $agents"
              done
              if do_nodesv $agents "pgrep -x $HSMTOOL_BASE"; then
                      error "Copytool failed to stop in ${TIMEOUT}s ..."
              else
                      echo "Copytool has stopped in " \
                           "$((TIMEOUT - (end_wait - SECONDS)))s."
              fi
      

      causes failure in sanity-hsm.
      sanity-hsm.sh has set -e option, so checks of return code can cause an error

      Attachments

        Issue Links

          Activity

            [LU-7312] sanity-hsm: $? verification is not valid with set -e

            It seems there are other places in sanity-hsm.sh and sanity.sh with the same error test (and may be other test scripts).
            Why do we need set -e? (it is in all test scripts)

            jcl jacques-charles lafoucriere added a comment - It seems there are other places in sanity-hsm.sh and sanity.sh with the same error test (and may be other test scripts). Why do we need set -e? (it is in all test scripts)

            Yes, IMO the replacement should be done as mentioned.
            Initially when I would run say test_400 due to the exit status check it would exit from the script.
            Before the patch : the check

             if [ $? -ne 0 ]; then 

            leads to exit.

            [root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh
            Logging to shared log directory: /tmp/test_logs/1445246722
            Starting client Bhagyesh:  -o user_xattr,flock Bhagyesh@tcp:/lustre /mnt/lustre2
            Started clients Bhagyesh:
            Bhagyesh@tcp:/lustre on /mnt/lustre2 type lustre (rw,user_xattr,flock)
            Bhagyesh: Checking config lustre mounted on /mnt/lustre
            Checking servers environments
            Checking clients Bhagyesh environments
            Using TIMEOUT=20
            disable quota as required
            osd-ldiskfs.track_declares_assert=1
            running as uid/gid/euid/egid 500/500/500/500, groups:
             [touch] [/mnt/lustre/d0_runas_test/f3594]
            excepting tests: 34 35 36
            Killing existing copytools on Bhagyesh
            Set HSM on and start
            Start copytool
            Purging archive on Bhagyesh
            Starting copytool agt1 on Bhagyesh
            Set sanity-hsm HSM policy
            
            
            == sanity-hsm test 400: Single request is sent to the right MDT == 14:55:43 (1445246743)
            
             SKIP: sanity-hsm test_400 needs >= 2 MDTs
            Resetting fail_loc on all nodes...done.
            SKIP 400 (0s)
            [root@Bhagyesh tests]#
            

            After the patch :

            [root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh
            Logging to shared log directory: /tmp/test_logs/1445246819
            Bhagyesh: Checking config lustre mounted on /mnt/lustre2
            Checking servers environments
            Checking clients Bhagyesh environments
            Bhagyesh: Checking config lustre mounted on /mnt/lustre
            Checking servers environments
            Checking clients Bhagyesh environments
            Using TIMEOUT=20
            disable quota as required
            osd-ldiskfs.track_declares_assert=1
            running as uid/gid/euid/egid 500/500/500/500, groups:
             [touch] [/mnt/lustre/d0_runas_test/f4943]
            excepting tests: 34 35 36
            Killing existing copytools on Bhagyesh
            Set HSM on and start
            Start copytool
            Purging archive on Bhagyesh
            Starting copytool agt1 on Bhagyesh
            Set sanity-hsm HSM policy
            
            
            == sanity-hsm test 400: Single request is sent to the right MDT == 14:57:20 (1445246840)
            
             SKIP: sanity-hsm test_400 needs >= 2 MDTs
            Resetting fail_loc on all nodes...done.
            SKIP 400 (0s)
            Copytool is stopped on Bhagyesh
            Copytool has stopped in  2s.
            mdt.lustre-MDT0000.hsm_control=shutdown
            Waiting 20 secs for update
            mdt.lustre-MDT0000.hsm_control=enabled
            == sanity-hsm test complete, duration 24 sec == 14:57:23 (1445246843)
            Stopping clients: Bhagyesh /mnt/lustre2 (opts:)
            Stopping client Bhagyesh /mnt/lustre2 opts:
            [root@Bhagyesh tests]#
            
            529964 Bhagyesh Dudhediya (Inactive) added a comment - - edited Yes, IMO the replacement should be done as mentioned. Initially when I would run say test_400 due to the exit status check it would exit from the script. Before the patch : the check if [ $? -ne 0 ]; then leads to exit. [root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh Logging to shared log directory: /tmp/test_logs/1445246722 Starting client Bhagyesh: -o user_xattr,flock Bhagyesh@tcp:/lustre /mnt/lustre2 Started clients Bhagyesh: Bhagyesh@tcp:/lustre on /mnt/lustre2 type lustre (rw,user_xattr,flock) Bhagyesh: Checking config lustre mounted on /mnt/lustre Checking servers environments Checking clients Bhagyesh environments Using TIMEOUT=20 disable quota as required osd-ldiskfs.track_declares_assert=1 running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f3594] excepting tests: 34 35 36 Killing existing copytools on Bhagyesh Set HSM on and start Start copytool Purging archive on Bhagyesh Starting copytool agt1 on Bhagyesh Set sanity-hsm HSM policy == sanity-hsm test 400: Single request is sent to the right MDT == 14:55:43 (1445246743) SKIP: sanity-hsm test_400 needs >= 2 MDTs Resetting fail_loc on all nodes...done. SKIP 400 (0s) [root@Bhagyesh tests]# After the patch : [root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh Logging to shared log directory: /tmp/test_logs/1445246819 Bhagyesh: Checking config lustre mounted on /mnt/lustre2 Checking servers environments Checking clients Bhagyesh environments Bhagyesh: Checking config lustre mounted on /mnt/lustre Checking servers environments Checking clients Bhagyesh environments Using TIMEOUT=20 disable quota as required osd-ldiskfs.track_declares_assert=1 running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f4943] excepting tests: 34 35 36 Killing existing copytools on Bhagyesh Set HSM on and start Start copytool Purging archive on Bhagyesh Starting copytool agt1 on Bhagyesh Set sanity-hsm HSM policy == sanity-hsm test 400: Single request is sent to the right MDT == 14:57:20 (1445246840) SKIP: sanity-hsm test_400 needs >= 2 MDTs Resetting fail_loc on all nodes...done. SKIP 400 (0s) Copytool is stopped on Bhagyesh Copytool has stopped in 2s. mdt.lustre-MDT0000.hsm_control=shutdown Waiting 20 secs for update mdt.lustre-MDT0000.hsm_control=enabled == sanity-hsm test complete, duration 24 sec == 14:57:23 (1445246843) Stopping clients: Bhagyesh /mnt/lustre2 (opts:) Stopping client Bhagyesh /mnt/lustre2 opts: [root@Bhagyesh tests]#

            Hello Bhagyesh,
            I have checked with recent auto-tests runs of sanity-hsm, and they all show full script execution with no problem and with multiple "Copytool is stopped on ..."/"Copytool has stopped in ..." msgs (in fact one for each sub-test using copytoll_cleanup() function) in their outputs.
            So can you better detail the problem you have encountered ?

            OTOH, re-reading Bash/set documentation (https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html), I understand that when "set -e" is being used, a simple command failure can lead to shell/script exit, even if this does not seem to occur ... Is this what you mean ?

            And then that the :

                            do_nodesv $agents "pgrep -x $HSMTOOL_BASE"
                            if [ $? -ne 0 ]; then
            

            construct should be replaced by :

                            if ! do_nodesv $agents "pgrep -x $HSMTOOL_BASE"; then
            

            to avoid any problem?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bhagyesh, I have checked with recent auto-tests runs of sanity-hsm, and they all show full script execution with no problem and with multiple "Copytool is stopped on ..."/"Copytool has stopped in ..." msgs (in fact one for each sub-test using copytoll_cleanup() function) in their outputs. So can you better detail the problem you have encountered ? OTOH, re-reading Bash/set documentation ( https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html ), I understand that when "set -e" is being used, a simple command failure can lead to shell/script exit, even if this does not seem to occur ... Is this what you mean ? And then that the : do_nodesv $agents "pgrep -x $HSMTOOL_BASE" if [ $? -ne 0 ]; then construct should be replaced by : if ! do_nodesv $agents "pgrep -x $HSMTOOL_BASE"; then to avoid any problem?

            Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/16866
            Subject: LU-7312 test: With set -e exit status in sanity-hsm was not valid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cbfec6f59b5239c093f5dc74266f859a1eb0f54f

            gerrit Gerrit Updater added a comment - Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/16866 Subject: LU-7312 test: With set -e exit status in sanity-hsm was not valid Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cbfec6f59b5239c093f5dc74266f859a1eb0f54f

            People

              bfaccini Bruno Faccini (Inactive)
              529964 Bhagyesh Dudhediya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: