[LU-2897] the lbuild "trap 'xxxx' ERR" must be unset before exit Created: 01/Mar/13  Updated: 24/Apr/15  Resolved: 24/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Bruce Korb (Inactive) Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6983

 Description   

If not cleared, the backtrace and bug reporting function gets invoked.
Since calls to fatal() imply an error was detected, it is not
an "uncaught error" but a well-known error.

Also a dinkleberry version of "usage()" was left in lbuild.



 Comments   
Comment by Bruce Korb (Inactive) [ 01/Mar/13 ]

Um, wrong web site. Sorry.

Comment by Christopher Morrone [ 02/Mar/13 ]

Even so, you are not wrong. Even just running "lbuild -h" trips the trap, dumps backtrace, and tries to send email to whamcloud. So I would say this site is a fine place to discuss and work on the problem.

I'm not sure which fatal() issue you are seeing, but I fixed one in a commit on github. I have a patch in gerrit to relocate the lbuild files, so I waiting to submit other lbuild changes to gerrit until that one lands.

That fix addresses a bug that makes fatal() throw an exception any time a message is not given as an option to fatal(). Perhaps that is the problem you are seeing?

That commit and a couple of other lbuild cleanup changes are on my LU-1199 work-in-progress branch here:

https://github.com/chaos/lustre/commits/LU-1199-wip

Comment by Bruce Korb (Inactive) [ 02/Mar/13 ]

In that case, a patch is on its way. I've fixed it in the Xyratex sources and presumed that it was me and didn't test the original. (I had made another change to make the configuring of lbuild comprehensible.)

Given the utter incomprehensibility of lbuild, I'm a bit surprised that it hasn't been seen before.

Anyway, the fix is to have fatal() disarm the uncaught error trap: trap '' ERR
and presto! no problem. And, yes, it is triggered by: [ -n '' ]
resulting in a command failure.

The fix I am preparing will also fix inconsistencies between the help text and the command line parsing, remove help text for the two no-op options and document the "extra" dozen configurables for which there are no command line options. I also create this config file template:

># lbuild configuration file (template - rename to '.lbuildrc')

># command line option: -d ROOT | --root-dir=ROOT
>#
># Specifies the SCM root directory to use when extracting files.
># ${SCMROOT:${CVSROOT:${GITROOT}}} is used if this option is not
># present.
>#
># SCMROOT=${SCMROOT:${CVSROOT:${GITROOT}}}

># command line option: -l LOCALDIR | --localdir=LOCALDIR
>#
># Specifies local directory with lustre sources.
>#
># LOCALDIR=

># command line option: --external-patches=PATCHES
>#
># Directory similar to lustre/lustre/kernel_patches/ that lbuild should
># look for seres and config files in before looking in the lustre tree.
>#
># EXTERNAL_PATCHES=

># command line option: --extraversion=EXTRA
>#
># Text to use for the rpm release and kernel extraversion.
>#
># EXTRA_VERSION=

># command line option: --timestamp=TIMESTAMP
>#
># Date of building lustre in format YYYYMMDDhhmmss
>#
># TIMESTAMP=

># command line option: --reuserpm=DIR
>#
># Try to reuse old kernel RPMs from DIR
>#
># REUSERPM=

># command line option: --reusebuild=DIR
>#
># Try to reuse old kernel builds from DIR
>#
># REUSEBUILD=

># command line option: --kernelrpm=DIR
>#
># Path to distro kernel RPM collection
>#
># KERNELRPMSBASE=

># command line option: --ccache
>#
># Use ccache
>#
># CCACHE=

># command line option: --norpm
>#
># Do not build RPMs (compile only mode)
>#
># NORPM=false

># command line option: --patchless
>#
># Build lustre client only
>#
># PATCHLESS=false

># command line option: --distro=DISTRO
>#
># Which distro using. Autodetect by default
>#
># DISTRO=

># command line option: --kernelurl=URL
>#
># URL for obtaining Linux source tarballs referenced by target files.
>#
># KERNELURL='http://downloads.lustre.org/public/kernels/$target/old'

># command line option: --kerneldir=KERNELDIR
>#
># Directory containing Linux source tarballs referenced by target
># files.
>#
># KERNELDIR=

># command line option: --kerneltree=KERNELTREE
>#
># Directory containing dirs with Linux source tarballs referenced by
># target files. Dir names in format kernel version ('2.6.9', etc.)
>#
># KERNELTREE=

># command line option: --linux=LINUX
>#
># Directory of Linux kernel sources. When this option is used, only
># Lustre modules and userspace are built.
>#
># LINUX=

># command line option: --lustre=LUSTRE
>#
># Path to an existing lustre source tarball to use instead of
># pulling from a source code archive.
>#
># LUSTRE=

># command line option: --nodownload
>#
># Do not try to download a kernel
>#
># DOWNLOAD=true

># command line option: --noiokit
>#
># Do not build lustre-iokit RPM. Build by default
>#
># IOKITRPM=true

># command line option: --release
>#
># Specifies that the files generated do not include timestamps,
># and that this is an official release.
>#
># RELEASE=false

># command line option: --src
>#
># Build a .src.rpm, a full kernel patch, and a patched kernel tarball.
># some recent hacking has pretty much neutered this option.
># search through this file (and lbuild.old_school – but that will
># be going away soon) for "-bb" and see how many places
># simply don't account for this option
>#
># DO_SRC=true

># command line option: --work=DIR
>#
># Directory to store work files during build process.
># The current directory by default.
>#
># WORKDIR=$PWD

># command line option: --stage=DIR
>#
># Directory used to stage packages for release. RPMs will be placed more
># or less in DIR/<target>-<arch>, and the tarball will be placed in DIR.
>#
># STAGEDIR=$PWD

># command line option: --tag=TAG
>#
># An SCM branch/tag/reference specification name to build from.
># If 'TAG' begins with 'refs/' or 'HEAD:refs/', then it must be
># a GIT ref spec.
>#
># TAG=

># command line option: --target=TARGET
>#
># The name of the target to build. The available targets are listed
># below.
>#
># TARGET=

># command line option: --target-arch=HOST_ARCH
>#
># Force a local architecture to value
>#
># TARGET_ARCH=$(uname -m)

># command line option: --target-archs=TARGET_ARCHS
>#
># A (space delimited) list of architectures to build. By default, all
># of the archs supported by the TARGET will be built, in addition to a
># .src.rpm. This option can limit those, for machines that can only
># build certain archs or if you only want a certain arch built (for
># testing, or a one-off kernel).
>#
># Also note that by using a non-base arch (eg, i386) only kernels
># will be built - there will be no lustre-lite-utils package.
>#
># TARGET_ARCHS=${TARGET_ARCH}

># command line option: --disable-datestamp
>#
># Prevents the datestamp flag (-D) from being passed to cvs for
># checkouts. This is a workaround for a problem encountered when
># using lbuild with tinderbox.
>#
># USE_DATESTAMP=1

># command line option: --xen
>#
># Builds a Xen domX kernel.
>#
># XEN=false

># command line option: --kvm
>#
># Builds a kernel with additional KVM features.
>#
># KVM=false

># command line option: -D DATE | --date=DATE
>#
># Overrides the current date.
>#
># DATE=$(date)

># command line option: --ofed-version=VER
>#
># Version for the Open Fabrics EDition ??
>#
># OFED_VERSION=

># other configurables:
>#
># Top build directory – the directory that contains "lbuild".
># TOPDIR=$(cd $(dirname $0) >/dev/null && pwd)
>#
># Some distros have many names for essentially the same target.
># # used for staging directory names and download RPM selection
># CANONICAL_TARGET=
>#
># change default behavior to only build for the current arch
># TARGET_ARCHS_ALL="$TARGET_ARCH"
># [ "$TARGET_ARCH" = "i686" ] && TARGET_ARCHS_ALL="i686 i586 i386"
>#
># Directory for finding old build products
># REUSEDIR="${WORKDIR}"
>#
># Temporary directory
># TMPDIR=${TMPDIR:-"/var/tmp"}
>#
># should cached products be used or force rebuilding?
># USE_BUILD_CACHE=true
>#
># Pattern used to decide when to skip the diskfs RPM
># SKIPLDISKFSRPM="v1_4_* b1_4"
>#
># linux kernel "object" (executable)
># LINUXOBJ=
>#
># default to not adding lustre into the kernel RPM package names
># KERNEL_LUSTRE_NAMING=false
>#
># default not use kabi check.
># USE_KABI=false
>#
># patchless build
># RPMSMPTYPE=
>#
># names of patch series which must be available
># SERIES=
>#
># Extra architectures, from target file
># BASE_ARCHS=
># BIGMEM_ARCHS=
># BOOT_ARCHS=
># JENSEN_ARCHS=
># SMP_ARCHS=
># BIGSMP_ARCHS=
># PSERIES64_ARCHS=
># UP_ARCHS=
>#
># build the lustre-tests rpm?
># LUSTRE_TESTS=true
>#
># CC=${CC:-gcc} # C-compiler

Comment by Christopher Morrone [ 02/Mar/13 ]

I don't think you need to disable the trap. They just need to stop using comparison (&&) as control flow in that function. Unless, perhaps your bug is different then mine.

I can't argue with your characterization of lbuild.

I think it has gone this long without being fixed because those who do the most work on Lustre outside of Intel probably don't use lbuild. At LLNL we put together a whole Linux distro based on RHEL, which means that we build our own kernel, our own IB drivers, etc. The lbuild view of the world, i.e. Lustre owns and builds everything, just doesn't jive with a world view where Lustre is just one component of an overall system.

I am only working on lbuild at all because I am overhauling the spec files in LU-1199, and I can't get any of my work landed until I make lbuild work with it and make the Intel build farm happy.

Comment by Bruce Korb (Inactive) [ 02/Mar/13 ]

No, you don't need to disable the trap, as long as you are careful to not have any commands "accidentally" finishing with a non-zero exit code. Still, if you have entered a function that is fail-exiting due to a discovered issue (beit just a request for "help" or misuse of options), it makes logical sense to me to disable the trap.

I'm futzing with the thing because Xyratex has a setup similar to Intel in having automated builds get kicked off upon review submissions. You'll be surprised to learn that we've had to adapt it. Much of the adaptation we plan on pushing back to WC/Intel since the changes ought to work there as well.

Comment by Bruce Korb (Inactive) [ 04/Mar/13 ]

Yes, you do need to disable the trap. lbuild is full of constructs like: str_val=$(myfunc ...)
If "myfunc" can fail, you will ALWAYS get a back trace, along with the error message.
The correct fix is to eliminate that construct and return strings in variables. Not today.
The short term fix is to set up a SIGUSR1 signal handler that disables the ERR trap. e.g.:
trap 'trap "" ERR ; sleep 1 ; exit 1' USR1
and then "kill -USR1 $$" from the subshells. That will disable the "unhandled error" trap and still have the parent process exit non-zero on subshell errors. Uglier than all get-out, but:

cleanup() {
set -x +e +E

# cleanup of uncaught error trap
#
trap '' ERR TERM

# force cleanup of uncaught error trap in parent process
#
local v=${BASH_VERSION%%.*}
if test ${v:-0} -gt 3
then v=${BASHPID}
else v=$(bash -c 'echo $PPID')
fi
test ${progpid} -eq ${v} || kill -USR1 ${progpid}
}

Comment by Brian Murrell (Inactive) [ 04/Mar/13 ]

Yes, you do need to disable the trap.

No you don't. You need to do error checking.

lbuild is full of constructs like: str_val=$(myfunc ...)

Which should be error checked if myfunc() can return an error. If they it is not, it is an unexpected error and a stacktrace should be generated.

If "myfunc" can fail, you will ALWAYS get a back trace, along with the error message.

Not if you trap (i.e. error check) the return. If myfunc() can fail then the caller must check it's status in order to avoid the stacktracing error function. That the caller did not check the error return is exactly what that function is supposed to call attention to.

So yes:

str_val=$(myfunc ...)

will cause the trap to trigger if myfunc() returns an error, however:

if ! str_val=$(myfunc ...); then

Will not cause a stacktrace because the error condition is checked.

$ myfunc(){
> echo "return value"
> false
> }
$ trap 'echo "trapped an error"' ERR
$ false
trapped an error
$ str_val=$(myfunc)
trapped an error
$ if ! str_val=$(myfunc); then
> echo "no error trap"
> fi
no error trap
Comment by Bruce Korb (Inactive) [ 05/Mar/13 ]

str_val=$(myfunc) || exit 1

I think we both know how to write code that does not trap on errors.
The question is not so much, "Can we make it work?" but rather
"What constructs are least prone to mistakes?" Given that the
lbuild code is liberally laced with val=$(myfunc) stuff that
trigger stack traces and email to home whenever someone's environment
causes "myfunc()" to fail, I think the "fail()" function should
pretty clearly take the responsibility for:

1. report the error (as it does now)
2. disarm the "you have an error that I don't know about" trap

the reasons being several fold:

1. "fatal()" is reporting the error, so it cannot be considered "unknown"
2. When "fatal()" is called from the parent process level, no stack
trace is triggered, but it is triggered from subshells.
This is a discrepancy that does not have a good reason.
3. fixing all the places where subshells can exit non-zero is
laborious and error prone.

The real solution is to not use subshells, except when genuinely needed.
Return text results should be in myfunc_result type variables.
That won't happen today. That is even more laborious and error prone,
unless you are starting a new project. Subshells ought to be constrained
to mostly co-processes and invoking a simple command where you need
to capture stdout. Not today.

Comment by Christopher Morrone [ 31/May/13 ]

My changes to fix the error() function and to remove the code to send email on any trapped error landed in commit 05403115a680adf51990bc9439d78947e8a7beba, "LU-1199 lbuild: Fix error handling".

Bruce, I would not suspect that there is something special about subshells that makes fatal() trigger a trap. I think the problem was that there was a bug in error() that caused a trap any time fatal() was called without a message. My commit fixes that.

While I'm not necessarily opposed to turning off the trap in fatal(), I don't think there is any need to do so now.

Furthermore, if I understand you correctly, you are complaining that "val=$(myfunc)", where "myfunc" decides to call fatal(), results in a trap and stack dump. I don't think that can be fixed by disabling the trap inside of fatal(). The subshell called fatal(), so the trap is only disabled in the subshell. When the parent shell sees the error code that is unhandled, it will still have its trap enabled.

I agree that the use of subshells is less than optimal in lbuild. But as you say, that requires more of a rewrite than folks probably want to do at the moment.

So unless I have missed something, I would say that the issue stated in the title of this ticket no longer applies, and this ticket can be closed.

Comment by Bruce Korb (Inactive) [ 31/May/13 ]

fatal exits non-zero, triggering the ERR trap.

I tend to make submissions and not reverify their applicability after new checkins.
So I have not looked at your "Fix error handling" fix.

> Furthermore, if I understand you correctly, you are complaining that "val=$(myfunc)",
> where "myfunc" decides to call fatal(), results in a trap and stack dump.
> I don't think that can be fixed by disabling the trap inside of fatal().

Hmm. Child to parent communication is a nuisance. So there needs to be some sort of
temp file lock. Create the thing and remove it when reporting an error and when
exiting cleanly. If it exists inside the error trap, THEN create the stack dump.
Probably best to leave off automated email, instead of fixing it and leaving it
off makes the stack trace suppression less crucial, but still should be done.
The "correct" way to do it is to put all the coding into a function and:

    exit_flag_file=$(mktemp ${TMPDIR:-/tmp}/err-exit-XXXXXX)
    unknown_err_handler() {
      test -f ${exit_flag_file} || return 0
      rm -f ${exit_flag_file}
      <<stack trace>>
    }
    trap 'srcfile=${BASH_SOURCE} srcline=${BASH_LINENO] unknown_err_handler' ERR

That eliminates the need to mess with this:

        if [ $n = 1 ]; then
            let lineno-=11
        fi

In any event, there were other fixes in my patch, too. viz. corrections in usage text
and consistency in handling various options and environment variable parameter
passing. (Never ever use one name as a long option name and a completely
different name as the environment variable version, and usage text really ought to
match what gets parsed.)

Comment by Christopher Morrone [ 31/May/13 ]

fatal exits non-zero, triggering the ERR trap.

I'm sorry, but I'm pretty sure you have that wrong. For example:

$cat exit.sh
#!/bin/bash

hello () {
  echo hello
}

trap hello ERR

fatal () {
  exit 1
}

fatal
$cat return.sh
#!/bin/bash

hello () {
  echo hello
}

trap hello ERR

fatal () {
  return 1
}

fatal
$./exit.sh
$./return.sh
hello
$

As you can see, the ERR trap is only triggered when a simple command has a non-zero exit status (see bash man page for more details). "exit" is not a "simple command", it is a special command that terminates the entire process. exit has no return code within the script, because it never returns.

Comment by Bruce Korb (Inactive) [ 02/Jun/13 ]
#! /bin/bash

typeset -r prog=$(basename "$0" .sh)
typeset -r program=$(basename "$0")
typeset -r progdir=$(\cd $(dirname "$0") && pwd -P)
typeset -r progpid=$$
typeset -r fflag=$(mktemp ${TMPDIR:-/tmp}/${prog}-fail-XXXXXX)

die() {
    test -f $fflag || return 1
    rm -f $fflag
    echo "${prog} error:  $*" >&2
    kill -TERM ${progpid}
    exit 1
}

bad_copy() {
    echo BROKEN 1>&2
    rm -f $fflag
    exit 1
}

trap 'die "trap on ERR"' ERR

set -e

txt=$(bad_copy)

echo WE SHOULD HAVE FAILED
rm -f $fflag

and when run:

$ bash fail.sh 
BROKEN
$ echo $?
1

I presume we were misunderstanding each other since I am guessing we both know how to write shell scripts.

Comment by Christopher Morrone [ 03/Jun/13 ]

I'd argue that you're on the path to violating the KISS principle. But frankly, I don't care that much about lbuild. I'm going to drop off this ticket.

Comment by Andreas Dilger [ 24/Apr/15 ]

No more interested parties for this issue.

Generated at Sat Feb 10 01:29:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.