[LU-10461] sanity 77c failure due to variables called in cleanup routine are out of scope Created: 05/Jan/18  Updated: 02/Apr/19  Resolved: 09/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: tests

Issue Links:
Related
is related to LU-9474 tests: register traps in a correct order Resolved
is related to LU-12146 remove "skip_env" for FSTYPE=zfs Open
is related to LU-11010 Remove call to ‘return’ after call to... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In sanity test 77c, we call cleanup_77c() at the end of the test to cleanup at the end of the test and a trap is set to call the clean up function when the test is exited. cleanup_77c() uses variables from test_77c which are declared local. From sanity.sh:

6375 cleanup_77c() {
6376         trap 0
6377         set_checksums 0
6378         $LCTL set_param osc.*osc-[^mM]*.checksum_dump=0
6379         $check_ost &&
6380                 do_facet ost1 $LCTL set_param obdfilter.*-OST*.checksum_dump=0
6381         [ -n $osc_file_prefix ] && rm -f ${osc_file_prefix}*
6382         $check_ost && [ -n $ost_file_prefix ] &&
6383                 do_facet ost1 rm -f ${ost_file_prefix}\*
6384 }
6385 
6386 test_77c() {
6387         [ $PARALLEL == "yes" ] && skip "skip parallel run" && return
6388         $GSS && skip "could not run with gss" && return
6389 
6390         local bad1
6391         local osc_file_prefix
6392         local osc_file
6393         local check_ost=false
6394         local ost_file_prefix
6395         local ost_file
6396         local orig_cksum
6397         local dump_cksum
6398         local fid
6399 
…
6417         osc_file_prefix=$($LCTL get_param -n debug_path)
6418         osc_file_prefix=${osc_file_prefix}-checksum_dump-osc-\\${fid}
6419 
6420         trap cleanup_77c EXIT
…

When the test exits normally and cleanup_77c() is called at the end of the test, there is no issue; the variables from the test are still in scope.

There is a problem when we exit test_77c when we encounter an error or exit the test before calling cleanup_77c. In these cases, the trap specified in line 6420 is invoked and cleanup_77c() is called. Yet, the variables that are local to test_77c are no longer visible to the cleanup routine.

One fix for this is to increase the scope of the variables, remove “local”, and this should fix the problem. What are other solutions?

Logs for this test failing are at:
https://testing.hpdd.intel.com/test_sets/9dbb758a-e767-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/6958de42-f002-11e7-854b-52540065bddc

I suspect there are more examples of this issue in our test suites.



 Comments   
Comment by Bruno Faccini (Inactive) [ 05/Jan/18 ]

Hmm, I am sure to be far to know all bash tricky behaviors but I remember that for particular sanity/test_77c case I made some testing to verify the scope of local vs exit/trap context like following :

[root@eagle-29 test_trap_with_local]# cat foo.sh
#!/bin/bash

footrap() {
trap 0
echo "foo=" $foo
}

fooerror() {
exit 1
}

foofunc() {
local foo
trap footrap EXIT
foo=foo

#exit
fooerror
}

foofunc

[root@eagle-29 test_trap_with_local]# bash foo.sh
foo= foo
[root@eagle-29 test_trap_with_local]# 
Comment by James Nunez (Inactive) [ 05/Jan/18 ]

Bruno - I agree that your example works and the variable scope in the original test code worked. Patch https://review.whamcloud.com/#/c/30653/ is adding a "return 0" after the call to skip so we exit the test code early at:

	$check_ost || skip "No need to check cksum dump on OSS"

This changes the behavior of the test to:

#!/usr/bin/bash

footrap() {
trap 0
echo "foo=" $foo
}

fooerror() {
exit 1
}

foofunc() {
local foo
trap footrap EXIT

foo=foo

# exit early
[ $foo == 'foo' ] && return 0

fooerror
}

foofunc

Perhaps I should word the description/title of this ticket differently.

Comment by Bruno Faccini (Inactive) [ 06/Jan/18 ]

Ok, I better understand now how the local can become out-of-scope! And may be I have also been wrong originally to make the assumption that skip() wad ending with an "exit 0", like is error() doing with an "exit 1".

So, may be we should not "return 0" after calling skip(), as done now in https://review.whamcloud.com/#/c/[30402,30653]/ but instead "exit 0" !! What do you think?

Comment by Andreas Dilger [ 10/Jan/18 ]

Quentin and I discussed this same issue in patch https://review.whamcloud.com/29653 "LU-9474 tests: add the stack_trap() utility function", and the conclusion was that the local variables would remain in scope for the exit trap.

I'm not sure if stack_trap itself avoids the local issue, or that by using stack_trap we can just avoid having complex cleanup functions that reference local variables.

Comment by Quentin Bouget [ 10/Jan/18 ]

James is right in that on "normal exit" (with return, rather than exit), local variables are wiped before the trap is run (I did not know until now).

I can see two ways out of this:

  • use exit 0 instead of return 0 (in the skip function);
  • use stack_trap() to evaluate variables at registration.

For example in test_77c, that would give something like :

test_77c() {
    [...]
    stack_trap "set_checksums 0" EXIT
    set_checksums 1
    [...]
    stack_trap "$LCTL set_param osc.*osc-[^mM]*.checksum_dump=0" EXIT
    $LCTL set_param osc.*osc-[^mM]*.checksum_dump=1
    [...]
    if $check_ost; then
        stack_trap "do_facet ost1 \
                    $LCTL set_param obdfilter.*-OST*.checkum_dump=0" EXIT
        do_facet ost1 $LCTL set_param obdfilter.*-OST*.checksum_dump=1
    fi
    [...]
    # I am not really sure where the last two traps should go so this needs checking
    [ -n "$osc_file_prefix" ] && stack_trap "rm -f '$osc_file_prefix'*" EXIT
    dd if=$DIR/$tfile of=/dev/null bs=1M || error "dd read error: $?"
    [...]
    $check_ost || skip "No need to check cksum dump on OSS"
    [ -n "$ost_file_prefix" ] &&
        stack_trap "do_facet ost1 \"rm -f '$ost_file_prefix'*\"" EXIT
}

Note that until this patch lands, using single quotes inside traps does not work as intended for stack_trap().

Comment by Bruno Faccini (Inactive) [ 10/Jan/18 ]

Andreas, since Quentin is also ok with my previous fix idea to "exit 0" in/after skip(), may be we can just go for it as it looks pretty simple. What do you think?

Comment by Quentin Bouget [ 10/Jan/18 ]

For the record, why isn't there a skip_noexit() and why isn't skip() exiting the test itself?

Comment by James Nunez (Inactive) [ 10/Jan/18 ]

I agree that we are being inconsistent with error() calling exit and skip() not calling exit. Let me look into adding exit to skip and if we need a skip_noexit(). Thank you all for the comments on this issue.

Comment by Gerrit Updater [ 21/Jan/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30964
Subject: LU-10461 tests: exit in the skip routine
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e85b9f63926d775db60b9af7f1be3039d766b121

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30964/
Subject: LU-10461 tests: call exit in the skip routine
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 33aad7829de56353cca5bbd082d95d6821a7be9c

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 10/Apr/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/31942
Subject: Revert "LU-10461 tests: call exit in the skip routine"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b210d5d2db976070b46b90867d9dc275bb329f3f

Comment by Gerrit Updater [ 11/Jan/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34015
Subject: LU-10461 tests: call exit in the skip routine
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 3b92d212abebcaf1c955b03ba4f0ecfcc125e901

Comment by Gerrit Updater [ 14/Jan/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34023
Subject: LU-10461 tests: call exit in the skip routine
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: f1b695016446fa0d7169c2caf2149f0593c2abd6

Comment by Gerrit Updater [ 19/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34023/
Subject: LU-10461 tests: call exit in the skip routine
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 8c6308656bb4de6b661ab60226a45ec8ae7729fe

Generated at Sat Feb 10 02:35:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.