[LU-5208] sanity-lfsck test_18c failure: Expect 3 fixed on mds1, but got: 2 Created: 16/Jun/14  Updated: 25/Aug/14  Resolved: 25/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: lfsck
Environment:

Lustre 2.5.60 on the OpenSFS cluster, CentOS 6.5 with one server (mds01) with a MGS and MDS with two MDTs, another server (mds02) with MDS and two MDTs, four OSSs with two OSTs each and four clients.


Issue Links:
Related
is related to LU-5209 sanity-lfsck test_18d failure: Expect... Resolved
is related to LU-5210 sanity-lfsck test 18e failure: Expect... Resolved
is related to LU-5211 sanity-lfsck test 19a failure: Read s... Resolved
Severity: 3
Rank (Obsolete): 14536

 Description   

Running sanity-lfsck with the stated environment, tests 18c, 18d, 18e, and 19a fail and test 19b hangs. Test results are at https://maloo.whamcloud.com/test_sessions/5ad54b54-f5a5-11e3-b29e-52540035b04c .

sanity-lfsck test 18c fails with the error:

sanity-lfsck test_18c: @@@@@@ FAIL: (4) Expect 3 fixed on mds1, but got: 2

Right before this test fails, the output from /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_layout on mds01, MDT0, is:

name: lfsck_layout
magic: 0xb173ae14
version: 2
status: completed
flags:
param: all_targets,orphan
time_since_last_completed: 2912 seconds
time_since_latest_start: 2912 seconds
time_since_last_checkpoint: 2912 seconds
latest_start_position: 0
last_checkpoint_position: 25098
first_failure_position: 0
success_count: 1
repaired_dangling: 0
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 2
repaired_inconsistent_owner: 0
repaired_others: 0
skipped: 0
failed_phase1: 0
failed_phase2: 0
checked_phase1: 8
checked_phase2: 2
run_time_phase1: 0 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 8 items/sec
average_speed_phase2: 2 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A


 Comments   
Comment by nasf (Inactive) [ 08/Jul/14 ]

Please refer to the comment in the LU-5209, we need the LFSCK log to analysis the LFSCK behaviour.

Comment by James Nunez (Inactive) [ 15/Jul/14 ]

sanity-lfsck test logs for 2.6.0-RC1 are at: https://testing.hpdd.intel.com/test_sessions/5e3c96b0-0c68-11e4-9892-5254006e85c2 .

Comment by nasf (Inactive) [ 30/Jul/14 ]

Here is the patch:
http://review.whamcloud.com/#/c/11275/

Comment by James Nunez (Inactive) [ 05/Aug/14 ]

I tried patch 11275 and the test passes, but the output and the comments don't match.

From the output of sanity-lfsck test 18c with this patch:

Trigger layout LFSCK on all devices to find out orphan OST-object
Started LFSCK on the device scratch-MDT0000: scrub layout
There should be some stub under .lustre/lost+found/MDT0001/
ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory
There should be some stub under .lustre/lost+found/MDT0000/
216172799310430210 -r-------- 1 root root 2097152 Aug  4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x2:0x0]-N-0
216172799310430211 -r-------- 1 root root 2097152 Aug  4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x3:0x0]-N-0
Resetting fail_loc on all nodes...done.
PASS 18c (7s)

So, the comment expects something to be in $mount/.lustre/lost+found/scratch-MDT0001, but there is no scratch-MDT0001 subdirectory under lost+found. Maybe with the change in this patch to using "$LFS setstripe -c 1", we shouldn't expect anything there to be an MDT0001 subdirectory?

I put some debug prints n the test and there is not scratch-MDT0001 subdirectory:

Trigger layout LFSCK on all devices to find out orphan OST-object
Started LFSCK on the device scratch-MDT0000: scrub layout
ls -ail /lustre/scratch/.lustre/lost+found/
total 8
144115188109410307 dr-x------ 3 root root 4096 Aug  4 17:37 .
216172799310430209 drwx------ 3 root root 4096 Aug  4 17:38 MDT0000
ls -ail /lustre/scratch/.lustre/lost+found/MDT*
total 4104
216172799310430209 drwx------ 3 root root    4096 Aug  4 17:38 .
144115188109410307 dr-x------ 3 root root    4096 Aug  4 17:37 ..
216172799310430210 -r-------- 1 root root 2097152 Aug  4 17:38 [0x300000401:0x2:0x0]-N-0
216172799310430211 -r-------- 1 root root 2097152 Aug  4 17:38 [0x300000401:0x3:0x0]-N-0
There should be some stub under .lustre/lost+found/MDT0001/
ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory
Comment by nasf (Inactive) [ 05/Aug/14 ]

It is the test scripts issue, the comment should be "There should NOT be some stub under .lustre/lost+found/MDT0001/". I will update the patch.

Comment by nasf (Inactive) [ 25/Aug/14 ]

The patch has been landed to master.

Generated at Sat Feb 10 01:49:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.