[LU-2897] the lbuild "trap 'xxxx' ERR" must be unset before exit Created: 01/Mar/13 Updated: 24/Apr/15 Resolved: 24/Apr/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bruce Korb (Inactive) | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6983 |
| Description |
|
If not cleared, the backtrace and bug reporting function gets invoked. Also a dinkleberry version of "usage()" was left in lbuild. |
| Comments |
| Comment by Bruce Korb (Inactive) [ 01/Mar/13 ] |
|
Um, wrong web site. Sorry. |
| Comment by Christopher Morrone [ 02/Mar/13 ] |
|
Even so, you are not wrong. Even just running "lbuild -h" trips the trap, dumps backtrace, and tries to send email to whamcloud. So I would say this site is a fine place to discuss and work on the problem. I'm not sure which fatal() issue you are seeing, but I fixed one in a commit on github. I have a patch in gerrit to relocate the lbuild files, so I waiting to submit other lbuild changes to gerrit until that one lands. That fix addresses a bug that makes fatal() throw an exception any time a message is not given as an option to fatal(). Perhaps that is the problem you are seeing? That commit and a couple of other lbuild cleanup changes are on my |
| Comment by Bruce Korb (Inactive) [ 02/Mar/13 ] |
|
In that case, a patch is on its way. I've fixed it in the Xyratex sources and presumed that it was me and didn't test the original. (I had made another change to make the configuring of lbuild comprehensible.) Given the utter incomprehensibility of lbuild, I'm a bit surprised that it hasn't been seen before. Anyway, the fix is to have fatal() disarm the uncaught error trap: trap '' ERR The fix I am preparing will also fix inconsistencies between the help text and the command line parsing, remove help text for the two no-op options and document the "extra" dozen configurables for which there are no command line options. I also create this config file template: ># lbuild configuration file (template - rename to '.lbuildrc') ># command line option: -d ROOT | --root-dir=ROOT ># command line option: -l LOCALDIR | --localdir=LOCALDIR ># command line option: --external-patches=PATCHES ># command line option: --extraversion=EXTRA ># command line option: --timestamp=TIMESTAMP ># command line option: --reuserpm=DIR ># command line option: --reusebuild=DIR ># command line option: --kernelrpm=DIR ># command line option: --ccache ># command line option: --norpm ># command line option: --patchless ># command line option: --distro=DISTRO ># command line option: --kernelurl=URL ># command line option: --kerneldir=KERNELDIR ># command line option: --kerneltree=KERNELTREE ># command line option: --linux=LINUX ># command line option: --lustre=LUSTRE ># command line option: --nodownload ># command line option: --noiokit ># command line option: --release ># command line option: --src ># command line option: --work=DIR ># command line option: --stage=DIR ># command line option: --tag=TAG ># command line option: --target=TARGET ># command line option: --target-arch=HOST_ARCH ># command line option: --target-archs=TARGET_ARCHS ># command line option: --disable-datestamp ># command line option: --xen ># command line option: --kvm ># command line option: -D DATE | --date=DATE ># command line option: --ofed-version=VER ># other configurables: |
| Comment by Christopher Morrone [ 02/Mar/13 ] |
|
I don't think you need to disable the trap. They just need to stop using comparison (&&) as control flow in that function. Unless, perhaps your bug is different then mine. I can't argue with your characterization of lbuild. I think it has gone this long without being fixed because those who do the most work on Lustre outside of Intel probably don't use lbuild. At LLNL we put together a whole Linux distro based on RHEL, which means that we build our own kernel, our own IB drivers, etc. The lbuild view of the world, i.e. Lustre owns and builds everything, just doesn't jive with a world view where Lustre is just one component of an overall system. I am only working on lbuild at all because I am overhauling the spec files in |
| Comment by Bruce Korb (Inactive) [ 02/Mar/13 ] |
|
No, you don't need to disable the trap, as long as you are careful to not have any commands "accidentally" finishing with a non-zero exit code. Still, if you have entered a function that is fail-exiting due to a discovered issue (beit just a request for "help" or misuse of options), it makes logical sense to me to disable the trap. I'm futzing with the thing because Xyratex has a setup similar to Intel in having automated builds get kicked off upon review submissions. You'll be surprised to learn that we've had to adapt it. Much of the adaptation we plan on pushing back to WC/Intel since the changes ought to work there as well. |
| Comment by Bruce Korb (Inactive) [ 04/Mar/13 ] |
|
Yes, you do need to disable the trap. lbuild is full of constructs like: str_val=$(myfunc ...) cleanup() { # cleanup of uncaught error trap # force cleanup of uncaught error trap in parent process |
| Comment by Brian Murrell (Inactive) [ 04/Mar/13 ] |
No you don't. You need to do error checking.
Which should be error checked if myfunc() can return an error. If they it is not, it is an unexpected error and a stacktrace should be generated.
Not if you trap (i.e. error check) the return. If myfunc() can fail then the caller must check it's status in order to avoid the stacktracing error function. That the caller did not check the error return is exactly what that function is supposed to call attention to. So yes: str_val=$(myfunc ...) will cause the trap to trigger if myfunc() returns an error, however: if ! str_val=$(myfunc ...); then
Will not cause a stacktrace because the error condition is checked. $ myfunc(){
> echo "return value"
> false
> }
$ trap 'echo "trapped an error"' ERR
$ false
trapped an error
$ str_val=$(myfunc)
trapped an error
$ if ! str_val=$(myfunc); then
> echo "no error trap"
> fi
no error trap
|
| Comment by Bruce Korb (Inactive) [ 05/Mar/13 ] |
|
str_val=$(myfunc) || exit 1 I think we both know how to write code that does not trap on errors. 1. report the error (as it does now) the reasons being several fold: 1. "fatal()" is reporting the error, so it cannot be considered "unknown" The real solution is to not use subshells, except when genuinely needed. |
| Comment by Christopher Morrone [ 31/May/13 ] |
|
My changes to fix the error() function and to remove the code to send email on any trapped error landed in commit 05403115a680adf51990bc9439d78947e8a7beba, " Bruce, I would not suspect that there is something special about subshells that makes fatal() trigger a trap. I think the problem was that there was a bug in error() that caused a trap any time fatal() was called without a message. My commit fixes that. While I'm not necessarily opposed to turning off the trap in fatal(), I don't think there is any need to do so now. Furthermore, if I understand you correctly, you are complaining that "val=$(myfunc)", where "myfunc" decides to call fatal(), results in a trap and stack dump. I don't think that can be fixed by disabling the trap inside of fatal(). The subshell called fatal(), so the trap is only disabled in the subshell. When the parent shell sees the error code that is unhandled, it will still have its trap enabled. I agree that the use of subshells is less than optimal in lbuild. But as you say, that requires more of a rewrite than folks probably want to do at the moment. So unless I have missed something, I would say that the issue stated in the title of this ticket no longer applies, and this ticket can be closed. |
| Comment by Bruce Korb (Inactive) [ 31/May/13 ] |
|
fatal exits non-zero, triggering the ERR trap. I tend to make submissions and not reverify their applicability after new checkins. > Furthermore, if I understand you correctly, you are complaining that "val=$(myfunc)", Hmm. Child to parent communication is a nuisance. So there needs to be some sort of exit_flag_file=$(mktemp ${TMPDIR:-/tmp}/err-exit-XXXXXX)
unknown_err_handler() {
test -f ${exit_flag_file} || return 0
rm -f ${exit_flag_file}
<<stack trace>>
}
trap 'srcfile=${BASH_SOURCE} srcline=${BASH_LINENO] unknown_err_handler' ERR
That eliminates the need to mess with this: if [ $n = 1 ]; then
let lineno-=11
fi
In any event, there were other fixes in my patch, too. viz. corrections in usage text |
| Comment by Christopher Morrone [ 31/May/13 ] |
I'm sorry, but I'm pretty sure you have that wrong. For example: $cat exit.sh
#!/bin/bash
hello () {
echo hello
}
trap hello ERR
fatal () {
exit 1
}
fatal
$cat return.sh
#!/bin/bash
hello () {
echo hello
}
trap hello ERR
fatal () {
return 1
}
fatal
$./exit.sh
$./return.sh
hello
$
As you can see, the ERR trap is only triggered when a simple command has a non-zero exit status (see bash man page for more details). "exit" is not a "simple command", it is a special command that terminates the entire process. exit has no return code within the script, because it never returns. |
| Comment by Bruce Korb (Inactive) [ 02/Jun/13 ] |
#! /bin/bash
typeset -r prog=$(basename "$0" .sh)
typeset -r program=$(basename "$0")
typeset -r progdir=$(\cd $(dirname "$0") && pwd -P)
typeset -r progpid=$$
typeset -r fflag=$(mktemp ${TMPDIR:-/tmp}/${prog}-fail-XXXXXX)
die() {
test -f $fflag || return 1
rm -f $fflag
echo "${prog} error: $*" >&2
kill -TERM ${progpid}
exit 1
}
bad_copy() {
echo BROKEN 1>&2
rm -f $fflag
exit 1
}
trap 'die "trap on ERR"' ERR
set -e
txt=$(bad_copy)
echo WE SHOULD HAVE FAILED
rm -f $fflag
and when run: $ bash fail.sh BROKEN $ echo $? 1 I presume we were misunderstanding each other since I am guessing we both know how to write shell scripts. |
| Comment by Christopher Morrone [ 03/Jun/13 ] |
|
I'd argue that you're on the path to violating the KISS principle. But frankly, I don't care that much about lbuild. I'm going to drop off this ticket. |
| Comment by Andreas Dilger [ 24/Apr/15 ] |
|
No more interested parties for this issue. |