Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11142

test-framework.sh run_e2fsck masks return code

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Upstream, Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      run_e2fsck ... ... "-n" does not return non-zero exit status if fs errors found.

      it makes fs consistency checks by run_e2fsck almost useless.

      I see e2fsck checks in many tests in sanity and conf-sanity scripts:

      [zam@vm1 lustre-wc-rel]$ grep -e "run_e2fsck.*-n" lustre/tests/*.sh
      lustre/tests/conf-sanity.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
      lustre/tests/conf-sanity.sh: run_e2fsck $mds1host $mds1dev "-n"
      lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n" |
      lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n"
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds${mds_index}) $devname -n
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$mdt_index) $devname -n ||
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$idx) $dev -n ||
      [zam@vm1 lustre-wc-rel]$
       

       
      for example test_804 in sanity.sh:

       for idx in $(seq $MDSCOUNT); do
       dev=$(mdsdevname $idx)
       rc=0
      
      stop mds${idx}
       run_e2fsck $(facet_active_host mds$idx) $dev -n ||
       rc=$?
       start mds${idx} $dev $MDS_MOUNT_OPTS ||
       error "mount mds$idx failed"
       df $MOUNT > /dev/null 2>&1
      
      # e2fsck should not return error
       [ $rc -eq 0 ] ||
       error "e2fsck detected error on MDT${idx}: rc=$rc"
       done
      
      

      this code will never fails, because e2fsck exit code is lost in run_e2fsck function:

      another example is a test for LU-2634, it is about:

      Short symlinks on MDT filesystems formatted with the "extents" feature appear to be created with the EXT4_EXTENTS_FL in osd-ldiskfs, but that shouldn't be happening. e2fsck considers this a corruption and deletes the symlink.

      the test runs e2fsck at the end 

              #umount
              umount_client $MOUNT || error "umount client failed"
              stop_mds || error "stop mds failed"
              stop_ost || error "stop ost failed"
      
              #run e2fsck
              run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
      }
      

      intention is to check the fs and fail the test if FS corrupted. there is no attempt to parse fsck output but only checking the exit code.

       

      however run_e2fsck coverts all exit codes below or equal 4 (FSCK_MAX_ERR) to 0:

      # Run e2fsck on MDT or OST device.
      run_e2fsck() {
       local node=$1
       local target_dev=$2
       local extra_opts=$3
       local cmd="$E2FSCK -d -v -t -t -f $extra_opts $target_dev"
       local log=$TMP/e2fsck.log
       local rc=0
      
      echo $cmd
       do_node $node $cmd 2>&1 | tee $log
       rc=${PIPESTATUS[0]}
       if [ -n "$(grep "DNE mode isn't supported" $log)" ]; then
       rm -f $log
       if [ $MDSCOUNT -gt 1 ]; then
       skip "DNE mode isn't supported!"
       cleanupall
       exit_status
       else
       error "It's not DNE mode."
       fi
       fi
       rm -f $log
      
      [ $rc -le $FSCK_MAX_ERR ] ||
       error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"
      
      return 0
      }
      
      

      It should be return $rc at the end.

      FYI, e2fsck exit codes are:

              The exit code returned by e2fsck is the sum of the following conditions:
                  0    - No errors
                  1    - File system errors corrected
                  2    - File system errors corrected, system should
                         be rebooted
                  4    - File system errors left uncorrected
                  8    - Operational error
                  16   - Usage or syntax error
                  32   - E2fsck canceled by user request
                  128  - Shared library error

       

      A workaround is to set FSCK_MAX_ERR to 0 before calling run_e2fsck , but nobody uses it in the tests. Or the variable is set globally in Maloo setup ...  it means the default setting should be changed.

       

      Attachments

        Issue Links

          Activity

            People

              zam Alexander Zarochentsev
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: