[LU-5190] lfsck: FAIL: e2fsck returned 4, should be <= 1 Created: 13/Jun/14 Updated: 16/Jun/14 Resolved: 16/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jian Yu | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre build: http://build.whamcloud.com/job/lustre-b2_5/63/ (2.5.2 RC1) |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14508 | ||||||||
| Description |
|
lfsck test failed as follows: lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
751 inodes used (0.07%, out of 1048576)
13 non-contiguous files (1.7%)
0 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 1/0/0
154242 blocks used (29.42%, out of 524288)
0 bad blocks
1 large file
560 regular files
182 directories
0 character device files
0 block device files
0 fifos
10 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
469 files
Memory used: 2756k/21184k (1093k/1664k), time: 4.83/ 0.11/ 0.06
I/O read: 9MB, write: 0MB, rate: 1.86MB/s
lfsck : @@@@@@ FAIL: e2fsck -d -v -t -t -f -n --mdsdb /home/autotest/.autotest/2014-06-11/212151-69837947160380/mdsdb /dev/mapper/lvm--Role_MDS-P1 returned 4, should be <= 1
Maloo report: https://maloo.whamcloud.com/test_logs/b31a15b8-f2fa-11e3-a3d9-52540035b04c/show_text |
| Comments |
| Comment by Jian Yu [ 13/Jun/14 ] |
|
The failure occurred on all of the regression test sessions on Lustre 2.5.2 RC1: It's a regression in comparison with Lustre b2_5 build #61. |
| Comment by Peter Jones [ 13/Jun/14 ] |
|
Fan Yong Could you please advise on this one? Thanks Peter |
| Comment by nasf (Inactive) [ 14/Jun/14 ] |
|
This "lfsck" failure is not the "LFSCK" that we are working on for OpenSFS contract. From the test log, we only can say that the e2fsck found some on-disk data corruption. Such data corruption may be left from former other tests. I have checked the test history, when the failure occurred, the tests order was: 1) sanity-lfsck.sh For sanity-lfsck.sh, it will reformat the system after the testing, so when sanityn.sh started, the system must be clean. So it is quite possible that the data corruption was introduced by sanityn.sh or sanity-hsm.sh. Unfortunately, neither sanityn nor sanity-hsm can detect data corruption be itself. So they were marked as success. So there is no logs can be used for further analysis. |
| Comment by Oleg Drokin [ 14/Jun/14 ] |
|
Yujian, can you please see if thid is reproduceable on a single node simple cluster and then perhaps let's try to isolate which of the following patches caused this: LU-4852 osc: osc_extent_truncate()) ASSERTION( !ext->oe_urgent ) failed (detail / gitweb) LU-4676 hsm: Fix return value error of ct_run() (detail / gitweb) LU-4830 tests: only deactivate MDTs of Lustre FSNAME (detail / gitweb) LU-2524 test: Modify tdir to be single directory (detail / gitweb) LU-4573 tests: check all MDTs for open files (detail / gitweb) LU-4102 doc: recommend newer e2fsprogs version (detail / gitweb) LU-4780 lnet: NI shutdown may loop forever (detail / gitweb) LU-5100 llite: set dir LOV xattr length variable (detail / gitweb) LU-5133 tests: Add version check in sanity/238 (detail / gitweb) LU-3386 lproc: improve osc/mdc "imports" connect data (detail / gitweb) LU-5132 tests: Add version check to sanity/160c (detail / gitweb) LU-5047 tests: correct cleanup files in sanity.sh (detail / gitweb) LU-4887 tests: sanity-scrub interoperability tests with master (detail / gitweb) LU-4569 hsm: Prevent copytool from importing existing file. (detail / gitweb) LU-2272 statahead: ll_intent_drop_lock() called in spinlock (detail / gitweb) LU-5116 ptlrpc: race at req processing (detail / gitweb) Most of those are testing only changes, though |
| Comment by Jian Yu [ 14/Jun/14 ] |
|
Sure, Oleg, will do. |
| Comment by Jian Yu [ 16/Jun/14 ] |
|
Test results showed that this was a known issue on Lustre b2_5 branch. The reason that the failure was not detected in previous builds was that while running lfsck.sh, the Lustre filesystem was not empty: if is_empty_fs $MOUNT; then # create test directory mkdir -p $TESTDIR || error "mkdir $TESTDIR failed" # create some dirs and files on the filesystem create_files $TESTDIR $NUMDIRS $NUMFILES # ...... else # is_empty_fs $MOUNT FSCK_MAX_ERR=4 # file system errors left uncorrected sync; sync; sleep 3 # make sure all data flush back fi If we only ran lfsck.sh on previous builds, then the same failure also occurred. It was one of the changes in build #63 that disclosed the failure. The focus on this ticket is to fix the real failure. |
| Comment by Jian Yu [ 16/Jun/14 ] |
|
This is a duplicate of |