[LU-11142] test-framework.sh run_e2fsck masks return code Created: 11/Jul/18  Updated: 21/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Unresolved Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-11155 hidden failures of sanity 804 test Open
is related to LU-4651 run e2fsck after every test script Open
is related to LU-11485 MDS allows "lfs setstripe" to mark la... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

run_e2fsck ... ... "-n" does not return non-zero exit status if fs errors found.

it makes fs consistency checks by run_e2fsck almost useless.

I see e2fsck checks in many tests in sanity and conf-sanity scripts:

[zam@vm1 lustre-wc-rel]$ grep -e "run_e2fsck.*-n" lustre/tests/*.sh
lustre/tests/conf-sanity.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
lustre/tests/conf-sanity.sh: run_e2fsck $mds1host $mds1dev "-n"
lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n" |
lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n"
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds${mds_index}) $devname -n
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$mdt_index) $devname -n ||
lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$idx) $dev -n ||
[zam@vm1 lustre-wc-rel]$
 

 
for example test_804 in sanity.sh:

 for idx in $(seq $MDSCOUNT); do
 dev=$(mdsdevname $idx)
 rc=0

stop mds${idx}
 run_e2fsck $(facet_active_host mds$idx) $dev -n ||
 rc=$?
 start mds${idx} $dev $MDS_MOUNT_OPTS ||
 error "mount mds$idx failed"
 df $MOUNT > /dev/null 2>&1

# e2fsck should not return error
 [ $rc -eq 0 ] ||
 error "e2fsck detected error on MDT${idx}: rc=$rc"
 done

this code will never fails, because e2fsck exit code is lost in run_e2fsck function:

another example is a test for LU-2634, it is about:

Short symlinks on MDT filesystems formatted with the "extents" feature appear to be created with the EXT4_EXTENTS_FL in osd-ldiskfs, but that shouldn't be happening. e2fsck considers this a corruption and deletes the symlink.

the test runs e2fsck at the end 

        #umount
        umount_client $MOUNT || error "umount client failed"
        stop_mds || error "stop mds failed"
        stop_ost || error "stop ost failed"

        #run e2fsck
        run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
}

intention is to check the fs and fail the test if FS corrupted. there is no attempt to parse fsck output but only checking the exit code.

 

however run_e2fsck coverts all exit codes below or equal 4 (FSCK_MAX_ERR) to 0:

# Run e2fsck on MDT or OST device.
run_e2fsck() {
 local node=$1
 local target_dev=$2
 local extra_opts=$3
 local cmd="$E2FSCK -d -v -t -t -f $extra_opts $target_dev"
 local log=$TMP/e2fsck.log
 local rc=0

echo $cmd
 do_node $node $cmd 2>&1 | tee $log
 rc=${PIPESTATUS[0]}
 if [ -n "$(grep "DNE mode isn't supported" $log)" ]; then
 rm -f $log
 if [ $MDSCOUNT -gt 1 ]; then
 skip "DNE mode isn't supported!"
 cleanupall
 exit_status
 else
 error "It's not DNE mode."
 fi
 fi
 rm -f $log

[ $rc -le $FSCK_MAX_ERR ] ||
 error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"

return 0
}

It should be return $rc at the end.

FYI, e2fsck exit codes are:

        The exit code returned by e2fsck is the sum of the following conditions:
            0    - No errors
            1    - File system errors corrected
            2    - File system errors corrected, system should
                   be rebooted
            4    - File system errors left uncorrected
            8    - Operational error
            16   - Usage or syntax error
            32   - E2fsck canceled by user request
            128  - Shared library error

 

A workaround is to set FSCK_MAX_ERR to 0 before calling run_e2fsck , but nobody uses it in the tests. Or the variable is set globally in Maloo setup ...  it means the default setting should be changed.

 



 Comments   
Comment by Gerrit Updater [ 11/Jul/18 ]

Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/32807
Subject: LU-11142 tests: run_e2fsck masks return code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b9bae9d1c7375256214a7c791c97380ce4e90c37

Comment by Alexander Zarochentsev [ 11/Jul/18 ]

I added https://review.whamcloud.com/32807 patch to attempt to fix the problem and  catch all  regressions where run_e2fsck exit code is non-zero and should be ignored, there are at least two test cases e2fsck is used to move unconnected inodes to /lost+found .

Comment by Alexander Zarochentsev [ 18/Jul/18 ]

I filed LU-11155 for sanity test 804 failures.

Generated at Sat Feb 10 02:41:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.