[LU-5209] sanity-lfsck test_18d failure: Expect file2 size 4, but got 0 Created: 16/Jun/14  Updated: 30/Jul/14  Resolved: 30/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: lfsck
Environment:

Lustre 2.5.60 on the OpenSFS cluster, CentOS 6.5 with one server (mds01) with a MGS and MDS with two MDTs, another server (mds02) with MDS and two MDTs, four OSSs with two OSTs each and four clients.


Issue Links:
Related
is related to LU-5208 sanity-lfsck test_18c failure: Expect... Resolved
Severity: 3
Rank (Obsolete): 14538

 Description   

Running sanity-lfsck with the stated environment, tests 18c, 18d, 18e, and 19a fail and test 19b hangs. Test results are at https://maloo.whamcloud.com/test_sessions/5ad54b54-f5a5-11e3-b29e-52540035b04c .

sanity-lfsck test 18d fails with the error:

sanity-lfsck test_18d: @@@@@@ FAIL: (6) Expect file2 size 4, but got 0

The file size of 0 is correct

ls -il /lustre/scratch/d18d.sanity-lfsck/a1/f2
216172799293652997 -rw-r--r-- 1 bin bin 0 Jun 16 16:00 /lustre/scratch/d18d.sanity-lfsck/a1/f2

Calling

do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout 

before the call to ls shows:

name: lfsck_layout
magic: 0xb173ae14
version: 2
status: completed
flags:
param: all_targets,orphan,create_ostobj
time_since_last_completed: 2 seconds
time_since_latest_start: 7 seconds
time_since_last_checkpoint: 2 seconds
latest_start_position: 0
last_checkpoint_position: 25097
first_failure_position: 0
success_count: 1
repaired_dangling: 1
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 1
repaired_inconsistent_owner: 0
repaired_others: 0
skipped: 0
failed_phase1: 0
failed_phase2: 0
checked_phase1: 9
checked_phase2: 1
run_time_phase1: 0 seconds
run_time_phase2: 5 seconds
average_speed_phase1: 9 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A


 Comments   
Comment by nasf (Inactive) [ 08/Jul/14 ]

According to the "do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout" output, all are OK. The orphan has been found, but the original file "f2" was not recovered as expected. It is possible that the found orphan was not for the "f2", instead, it is probably the one left from the test_18c. In test_18c, we expect to find 3 orphans, but only 2 orphans were found, and this one in the test_19d may be that missed one.

I need the LFSCK log (which is enabled by default on the latest master) to analysis how the LFSCK repaired the inconsistency.

On the other hand, I suspected that the failure LU-5208/LU-5209/LU-5210/LU-5211 have some potential relationship. It is quite possible that some former LFSCK test cases failure left some dirty stub in the test environment and cause the subsequent LFSCK test cases failed. So let's focus on the first failure in LU-5208 firstly.

Comment by nasf (Inactive) [ 30/Jul/14 ]

Here is some debug log on the OST:

00000001:02000400:4.0:1405109211.079298:0:12374:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == sanity-lfsck test 18d: Find out orphan OST-object and repair it (4) == 13:06:50 (1405109210)
...
00000100:00100000:6.0:1405109211.397608:0:11561:0:(service.c:2094:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+6:31208:x1473357800770112:12345-192.168.2.119@o2ib:4
00002000:02000000:2.0:1405109211.397834:0:11561:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=1617, val=0***
00000100:00100000:2.0:1405109211.489455:0:11561:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+7:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 Request procesed in 91846us (91906us total) trans 21474836902 rc 0/0
00000100:00100000:2.0:1405109211.489461:0:11561:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.2.119@o2ib, seq: 17
...

According to the log, when the test file $DIR/$tdir/a1/f1 and $DIR/$tdir/a1/f2 were created, the PFID xattr were not set because of the failure injection (cfs_fail_loc=1617) that was the side effect from former failed test_18c, not the expected environment. So the subsequent LFSCK will use new FID as the orphan OST-object's parent ($DIR/$tdir/a1/f2). That is why the $DIR/$tdir/a1/f2 was not recovered as expected.

Comment by nasf (Inactive) [ 30/Jul/14 ]

Side effect of failed test_18c for LU-5208.

Generated at Sat Feb 10 01:49:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.