[LU-5190] lfsck: FAIL: e2fsck returned 4, should be <= 1 Created: 13/Jun/14  Updated: 16/Jun/14  Resolved: 16/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre build: http://build.whamcloud.com/job/lustre-b2_5/63/ (2.5.2 RC1)


Issue Links:
Duplicate
duplicates LU-4169 lfsck: FAIL: e2fsck returned 4, shoul... Resolved
Severity: 3
Rank (Obsolete): 14508

 Description   

lfsck test failed as follows:

lustre-MDT0000: ********** WARNING: Filesystem still has errors **********


         751 inodes used (0.07%, out of 1048576)
          13 non-contiguous files (1.7%)
           0 non-contiguous directories (0.0%)
             # of inodes with ind/dind/tind blocks: 1/0/0
      154242 blocks used (29.42%, out of 524288)
           0 bad blocks
           1 large file

         560 regular files
         182 directories
           0 character device files
           0 block device files
           0 fifos
          10 links
           0 symbolic links (0 fast symbolic links)
           0 sockets
------------
         469 files
Memory used: 2756k/21184k (1093k/1664k), time:  4.83/ 0.11/ 0.06
I/O read: 9MB, write: 0MB, rate: 1.86MB/s
 lfsck : @@@@@@ FAIL: e2fsck -d -v -t -t -f -n --mdsdb /home/autotest/.autotest/2014-06-11/212151-69837947160380/mdsdb /dev/mapper/lvm--Role_MDS-P1 returned 4, should be <= 1 

Maloo report: https://maloo.whamcloud.com/test_logs/b31a15b8-f2fa-11e3-a3d9-52540035b04c/show_text



 Comments   
Comment by Jian Yu [ 13/Jun/14 ]

The failure occurred on all of the regression test sessions on Lustre 2.5.2 RC1:
https://maloo.whamcloud.com/test_sessions/f120d00c-f2f7-11e3-a3d9-52540035b04c
https://maloo.whamcloud.com/test_sessions/ee26bffa-f2ee-11e3-a3d9-52540035b04c
https://maloo.whamcloud.com/test_sessions/ef2b0eac-f2d4-11e3-86ca-52540035b04c
https://maloo.whamcloud.com/test_sessions/2751df84-f2cd-11e3-a3d9-52540035b04c

It's a regression in comparison with Lustre b2_5 build #61.

Comment by Peter Jones [ 13/Jun/14 ]

Fan Yong

Could you please advise on this one?

Thanks

Peter

Comment by nasf (Inactive) [ 14/Jun/14 ]

This "lfsck" failure is not the "LFSCK" that we are working on for OpenSFS contract. From the test log, we only can say that the e2fsck found some on-disk data corruption. Such data corruption may be left from former other tests. I have checked the test history, when the failure occurred, the tests order was:

1) sanity-lfsck.sh
2) sanityn.sh
3) sanity-hsm.sh
4) lfsck.sh

For sanity-lfsck.sh, it will reformat the system after the testing, so when sanityn.sh started, the system must be clean. So it is quite possible that the data corruption was introduced by sanityn.sh or sanity-hsm.sh. Unfortunately, neither sanityn nor sanity-hsm can detect data corruption be itself. So they were marked as success. So there is no logs can be used for further analysis.

Comment by Oleg Drokin [ 14/Jun/14 ]

Yujian, can you please see if thid is reproduceable on a single node simple cluster and then perhaps let's try to isolate which of the following patches caused this:

LU-4852 osc: osc_extent_truncate()) ASSERTION( !ext->oe_urgent ) failed (detail / gitweb)
LU-4676 hsm: Fix return value error of ct_run() (detail / gitweb)
LU-4830 tests: only deactivate MDTs of Lustre FSNAME (detail / gitweb)
LU-2524 test: Modify tdir to be single directory (detail / gitweb)
LU-4573 tests: check all MDTs for open files (detail / gitweb)
LU-4102 doc: recommend newer e2fsprogs version (detail / gitweb)
LU-4780 lnet: NI shutdown may loop forever (detail / gitweb)
LU-5100 llite: set dir LOV xattr length variable (detail / gitweb)
LU-5133 tests: Add version check in sanity/238 (detail / gitweb)
LU-3386 lproc: improve osc/mdc "imports" connect data (detail / gitweb)
LU-5132 tests: Add version check to sanity/160c (detail / gitweb)
LU-5047 tests: correct cleanup files in sanity.sh (detail / gitweb)
LU-4887 tests: sanity-scrub interoperability tests with master (detail / gitweb)
LU-4569 hsm: Prevent copytool from importing existing file. (detail / gitweb)
LU-2272 statahead: ll_intent_drop_lock() called in spinlock (detail / gitweb)
LU-5116 ptlrpc: race at req processing (detail / gitweb)

Most of those are testing only changes, though

Comment by Jian Yu [ 14/Jun/14 ]

Sure, Oleg, will do.

Comment by Jian Yu [ 16/Jun/14 ]

Test results showed that this was a known issue on Lustre b2_5 branch. The reason that the failure was not detected in previous builds was that while running lfsck.sh, the Lustre filesystem was not empty:

if is_empty_fs $MOUNT; then
        # create test directory
        mkdir -p $TESTDIR || error "mkdir $TESTDIR failed"

        # create some dirs and files on the filesystem
        create_files $TESTDIR $NUMDIRS $NUMFILES

        # ......
else # is_empty_fs $MOUNT
        FSCK_MAX_ERR=4   # file system errors left uncorrected
        sync; sync; sleep 3 # make sure all data flush back
fi

If we only ran lfsck.sh on previous builds, then the same failure also occurred. It was one of the changes in build #63 that disclosed the failure.

The focus on this ticket is to fix the real failure.

Comment by Jian Yu [ 16/Jun/14 ]

This is a duplicate of LU-4169.

Generated at Sat Feb 10 01:49:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.