[LU-7312] sanity-hsm: $? verification is not valid with set -e Created: 19/Oct/15  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bhagyesh Dudhediya (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Not a Bug Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5622 copytool_cleanup function should chec... Resolved
Severity: 3
Project: HSM
Rank (Obsolete): 9223372036854775807

 Description   

The part of the fix for LU-5622

while (( SECONDS < end_wait )); do
                sleep 2
                do_nodesv $agents "pgrep -x $HSMTOOL_BASE"
                if [ $? -ne 0 ]; then
                        echo "Copytool is stopped on $agents"
                        break
                fi
                echo "Copytool still running on $agents"
        done
        if do_nodesv $agents "pgrep -x $HSMTOOL_BASE"; then
                error "Copytool failed to stop in ${TIMEOUT}s ..."
        else
                echo "Copytool has stopped in " \
                     "$((TIMEOUT - (end_wait - SECONDS)))s."
        fi

causes failure in sanity-hsm.
sanity-hsm.sh has set -e option, so checks of return code can cause an error



 Comments   
Comment by Gerrit Updater [ 19/Oct/15 ]

Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/16866
Subject: LU-7312 test: With set -e exit status in sanity-hsm was not valid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cbfec6f59b5239c093f5dc74266f859a1eb0f54f

Comment by Bruno Faccini (Inactive) [ 19/Oct/15 ]

Hello Bhagyesh,
I have checked with recent auto-tests runs of sanity-hsm, and they all show full script execution with no problem and with multiple "Copytool is stopped on ..."/"Copytool has stopped in ..." msgs (in fact one for each sub-test using copytoll_cleanup() function) in their outputs.
So can you better detail the problem you have encountered ?

OTOH, re-reading Bash/set documentation (https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html), I understand that when "set -e" is being used, a simple command failure can lead to shell/script exit, even if this does not seem to occur ... Is this what you mean ?

And then that the :

                do_nodesv $agents "pgrep -x $HSMTOOL_BASE"
                if [ $? -ne 0 ]; then

construct should be replaced by :

                if ! do_nodesv $agents "pgrep -x $HSMTOOL_BASE"; then

to avoid any problem?

Comment by Bhagyesh Dudhediya (Inactive) [ 19/Oct/15 ]

Yes, IMO the replacement should be done as mentioned.
Initially when I would run say test_400 due to the exit status check it would exit from the script.
Before the patch : the check

 if [ $? -ne 0 ]; then 

leads to exit.

[root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh
Logging to shared log directory: /tmp/test_logs/1445246722
Starting client Bhagyesh:  -o user_xattr,flock Bhagyesh@tcp:/lustre /mnt/lustre2
Started clients Bhagyesh:
Bhagyesh@tcp:/lustre on /mnt/lustre2 type lustre (rw,user_xattr,flock)
Bhagyesh: Checking config lustre mounted on /mnt/lustre
Checking servers environments
Checking clients Bhagyesh environments
Using TIMEOUT=20
disable quota as required
osd-ldiskfs.track_declares_assert=1
running as uid/gid/euid/egid 500/500/500/500, groups:
 [touch] [/mnt/lustre/d0_runas_test/f3594]
excepting tests: 34 35 36
Killing existing copytools on Bhagyesh
Set HSM on and start
Start copytool
Purging archive on Bhagyesh
Starting copytool agt1 on Bhagyesh
Set sanity-hsm HSM policy


== sanity-hsm test 400: Single request is sent to the right MDT == 14:55:43 (1445246743)

 SKIP: sanity-hsm test_400 needs >= 2 MDTs
Resetting fail_loc on all nodes...done.
SKIP 400 (0s)
[root@Bhagyesh tests]#

After the patch :

[root@Bhagyesh tests]# ONLY=400 ./sanity-hsm.sh
Logging to shared log directory: /tmp/test_logs/1445246819
Bhagyesh: Checking config lustre mounted on /mnt/lustre2
Checking servers environments
Checking clients Bhagyesh environments
Bhagyesh: Checking config lustre mounted on /mnt/lustre
Checking servers environments
Checking clients Bhagyesh environments
Using TIMEOUT=20
disable quota as required
osd-ldiskfs.track_declares_assert=1
running as uid/gid/euid/egid 500/500/500/500, groups:
 [touch] [/mnt/lustre/d0_runas_test/f4943]
excepting tests: 34 35 36
Killing existing copytools on Bhagyesh
Set HSM on and start
Start copytool
Purging archive on Bhagyesh
Starting copytool agt1 on Bhagyesh
Set sanity-hsm HSM policy


== sanity-hsm test 400: Single request is sent to the right MDT == 14:57:20 (1445246840)

 SKIP: sanity-hsm test_400 needs >= 2 MDTs
Resetting fail_loc on all nodes...done.
SKIP 400 (0s)
Copytool is stopped on Bhagyesh
Copytool has stopped in  2s.
mdt.lustre-MDT0000.hsm_control=shutdown
Waiting 20 secs for update
mdt.lustre-MDT0000.hsm_control=enabled
== sanity-hsm test complete, duration 24 sec == 14:57:23 (1445246843)
Stopping clients: Bhagyesh /mnt/lustre2 (opts:)
Stopping client Bhagyesh /mnt/lustre2 opts:
[root@Bhagyesh tests]#
Comment by jacques-charles lafoucriere [ 20/Oct/15 ]

It seems there are other places in sanity-hsm.sh and sanity.sh with the same error test (and may be other test scripts).
Why do we need set -e? (it is in all test scripts)

Generated at Sat Feb 10 02:07:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.