[LU-11142] test-framework.sh run_e2fsck masks return code Created: 11/Jul/18 Updated: 21/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream, Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
run_e2fsck ... ... "-n" does not return non-zero exit status if fs errors found. it makes fs consistency checks by run_e2fsck almost useless. I see e2fsck checks in many tests in sanity and conf-sanity scripts: [zam@vm1 lustre-wc-rel]$ grep -e "run_e2fsck.*-n" lustre/tests/*.sh
lustre/tests/conf-sanity.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
lustre/tests/conf-sanity.sh: run_e2fsck $mds1host $mds1dev "-n"
lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n" |
lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n"
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds${mds_index}) $devname -n
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$mdt_index) $devname -n ||
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$idx) $dev -n ||
[zam@vm1 lustre-wc-rel]$
for idx in $(seq $MDSCOUNT); do dev=$(mdsdevname $idx) rc=0 stop mds${idx} run_e2fsck $(facet_active_host mds$idx) $dev -n || rc=$? start mds${idx} $dev $MDS_MOUNT_OPTS || error "mount mds$idx failed" df $MOUNT > /dev/null 2>&1 # e2fsck should not return error [ $rc -eq 0 ] || error "e2fsck detected error on MDT${idx}: rc=$rc" done this code will never fails, because e2fsck exit code is lost in run_e2fsck function: another example is a test for
the test runs e2fsck at the end
#umount
umount_client $MOUNT || error "umount client failed"
stop_mds || error "stop mds failed"
stop_ost || error "stop ost failed"
#run e2fsck
run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
}
intention is to check the fs and fail the test if FS corrupted. there is no attempt to parse fsck output but only checking the exit code.
however run_e2fsck coverts all exit codes below or equal 4 (FSCK_MAX_ERR) to 0:
# Run e2fsck on MDT or OST device.
run_e2fsck() {
local node=$1
local target_dev=$2
local extra_opts=$3
local cmd="$E2FSCK -d -v -t -t -f $extra_opts $target_dev"
local log=$TMP/e2fsck.log
local rc=0
echo $cmd
do_node $node $cmd 2>&1 | tee $log
rc=${PIPESTATUS[0]}
if [ -n "$(grep "DNE mode isn't supported" $log)" ]; then
rm -f $log
if [ $MDSCOUNT -gt 1 ]; then
skip "DNE mode isn't supported!"
cleanupall
exit_status
else
error "It's not DNE mode."
fi
fi
rm -f $log
[ $rc -le $FSCK_MAX_ERR ] ||
error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"
return 0
}
It should be return $rc at the end. FYI, e2fsck exit codes are: The exit code returned by e2fsck is the sum of the following conditions:
0 - No errors
1 - File system errors corrected
2 - File system errors corrected, system should
be rebooted
4 - File system errors left uncorrected
8 - Operational error
16 - Usage or syntax error
32 - E2fsck canceled by user request
128 - Shared library error
A workaround is to set FSCK_MAX_ERR to 0 before calling run_e2fsck , but nobody uses it in the tests. Or the variable is set globally in Maloo setup ... it means the default setting should be changed.
|
| Comments |
| Comment by Gerrit Updater [ 11/Jul/18 ] |
|
Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/32807 |
| Comment by Alexander Zarochentsev [ 11/Jul/18 ] |
|
I added https://review.whamcloud.com/32807 patch to attempt to fix the problem and catch all regressions where run_e2fsck exit code is non-zero and should be ignored, there are at least two test cases e2fsck is used to move unconnected inodes to /lost+found . |
| Comment by Alexander Zarochentsev [ 18/Jul/18 ] |
|
I filed LU-11155 for sanity test 804 failures. |