Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11142

test-framework.sh run_e2fsck masks return code

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Upstream, Lustre 2.12.0
    • Fix Version/s: None
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      run_e2fsck ... ... "-n" does not return non-zero exit status if fs errors found.

      it makes fs consistency checks by run_e2fsck almost useless.

      I see e2fsck checks in many tests in sanity and conf-sanity scripts:

      [zam@vm1 lustre-wc-rel]$ grep -e "run_e2fsck.*-n" lustre/tests/*.sh
      lustre/tests/conf-sanity.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
      lustre/tests/conf-sanity.sh: run_e2fsck $mds1host $mds1dev "-n"
      lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n" |
      lustre/tests/sanity-lfsck.sh: run_e2fsck $(facet_active_host $SINGLEMDS) $(mdsdevname 1) "-n"
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds${mds_index}) $devname -n
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$mdt_index) $devname -n ||
      lustre/tests/sanity.sh: run_e2fsck $(facet_active_host mds$idx) $dev -n ||
      [zam@vm1 lustre-wc-rel]$
       

       
      for example test_804 in sanity.sh:

       for idx in $(seq $MDSCOUNT); do
       dev=$(mdsdevname $idx)
       rc=0
      
      stop mds${idx}
       run_e2fsck $(facet_active_host mds$idx) $dev -n ||
       rc=$?
       start mds${idx} $dev $MDS_MOUNT_OPTS ||
       error "mount mds$idx failed"
       df $MOUNT > /dev/null 2>&1
      
      # e2fsck should not return error
       [ $rc -eq 0 ] ||
       error "e2fsck detected error on MDT${idx}: rc=$rc"
       done
      
      

      this code will never fails, because e2fsck exit code is lost in run_e2fsck function:

      another example is a test for LU-2634, it is about:

      Short symlinks on MDT filesystems formatted with the "extents" feature appear to be created with the EXT4_EXTENTS_FL in osd-ldiskfs, but that shouldn't be happening. e2fsck considers this a corruption and deletes the symlink.

      the test runs e2fsck at the end 

              #umount
              umount_client $MOUNT || error "umount client failed"
              stop_mds || error "stop mds failed"
              stop_ost || error "stop ost failed"
      
              #run e2fsck
              run_e2fsck $(facet_active_host $SINGLEMDS) $mdsdev "-n"
      }
      

      intention is to check the fs and fail the test if FS corrupted. there is no attempt to parse fsck output but only checking the exit code.

       

      however run_e2fsck coverts all exit codes below or equal 4 (FSCK_MAX_ERR) to 0:

      # Run e2fsck on MDT or OST device.
      run_e2fsck() {
       local node=$1
       local target_dev=$2
       local extra_opts=$3
       local cmd="$E2FSCK -d -v -t -t -f $extra_opts $target_dev"
       local log=$TMP/e2fsck.log
       local rc=0
      
      echo $cmd
       do_node $node $cmd 2>&1 | tee $log
       rc=${PIPESTATUS[0]}
       if [ -n "$(grep "DNE mode isn't supported" $log)" ]; then
       rm -f $log
       if [ $MDSCOUNT -gt 1 ]; then
       skip "DNE mode isn't supported!"
       cleanupall
       exit_status
       else
       error "It's not DNE mode."
       fi
       fi
       rm -f $log
      
      [ $rc -le $FSCK_MAX_ERR ] ||
       error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"
      
      return 0
      }
      
      

      It should be return $rc at the end.

      FYI, e2fsck exit codes are:

              The exit code returned by e2fsck is the sum of the following conditions:
                  0    - No errors
                  1    - File system errors corrected
                  2    - File system errors corrected, system should
                         be rebooted
                  4    - File system errors left uncorrected
                  8    - Operational error
                  16   - Usage or syntax error
                  32   - E2fsck canceled by user request
                  128  - Shared library error

       

      A workaround is to set FSCK_MAX_ERR to 0 before calling run_e2fsck , but nobody uses it in the tests. Or the variable is set globally in Maloo setup ...  it means the default setting should be changed.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zam Alexander Zarochentsev
                Reporter:
                zam Alexander Zarochentsev
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: