[LU-5209] sanity-lfsck test_18d failure: Expect file2 size 4, but got 0 Created: 16/Jun/14 Updated: 30/Jul/14 Resolved: 30/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | lfsck | ||
| Environment: |
Lustre 2.5.60 on the OpenSFS cluster, CentOS 6.5 with one server (mds01) with a MGS and MDS with two MDTs, another server (mds02) with MDS and two MDTs, four OSSs with two OSTs each and four clients. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14538 | ||||||||
| Description |
|
Running sanity-lfsck with the stated environment, tests 18c, 18d, 18e, and 19a fail and test 19b hangs. Test results are at https://maloo.whamcloud.com/test_sessions/5ad54b54-f5a5-11e3-b29e-52540035b04c . sanity-lfsck test 18d fails with the error: sanity-lfsck test_18d: @@@@@@ FAIL: (6) Expect file2 size 4, but got 0 The file size of 0 is correct ls -il /lustre/scratch/d18d.sanity-lfsck/a1/f2 216172799293652997 -rw-r--r-- 1 bin bin 0 Jun 16 16:00 /lustre/scratch/d18d.sanity-lfsck/a1/f2 Calling do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout before the call to ls shows: name: lfsck_layout magic: 0xb173ae14 version: 2 status: completed flags: param: all_targets,orphan,create_ostobj time_since_last_completed: 2 seconds time_since_latest_start: 7 seconds time_since_last_checkpoint: 2 seconds latest_start_position: 0 last_checkpoint_position: 25097 first_failure_position: 0 success_count: 1 repaired_dangling: 1 repaired_unmatched_pair: 0 repaired_multiple_referenced: 0 repaired_orphan: 1 repaired_inconsistent_owner: 0 repaired_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 9 checked_phase2: 1 run_time_phase1: 0 seconds run_time_phase2: 5 seconds average_speed_phase1: 9 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A |
| Comments |
| Comment by nasf (Inactive) [ 08/Jul/14 ] |
|
According to the "do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout" output, all are OK. The orphan has been found, but the original file "f2" was not recovered as expected. It is possible that the found orphan was not for the "f2", instead, it is probably the one left from the test_18c. In test_18c, we expect to find 3 orphans, but only 2 orphans were found, and this one in the test_19d may be that missed one. I need the LFSCK log (which is enabled by default on the latest master) to analysis how the LFSCK repaired the inconsistency. On the other hand, I suspected that the failure |
| Comment by nasf (Inactive) [ 30/Jul/14 ] |
|
Here is some debug log on the OST: 00000001:02000400:4.0:1405109211.079298:0:12374:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == sanity-lfsck test 18d: Find out orphan OST-object and repair it (4) == 13:06:50 (1405109210) ... 00000100:00100000:6.0:1405109211.397608:0:11561:0:(service.c:2094:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+6:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 00002000:02000000:2.0:1405109211.397834:0:11561:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=1617, val=0*** 00000100:00100000:2.0:1405109211.489455:0:11561:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+7:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 Request procesed in 91846us (91906us total) trans 21474836902 rc 0/0 00000100:00100000:2.0:1405109211.489461:0:11561:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.2.119@o2ib, seq: 17 ... According to the log, when the test file $DIR/$tdir/a1/f1 and $DIR/$tdir/a1/f2 were created, the PFID xattr were not set because of the failure injection (cfs_fail_loc=1617) that was the side effect from former failed test_18c, not the expected environment. So the subsequent LFSCK will use new FID as the orphan OST-object's parent ($DIR/$tdir/a1/f2). That is why the $DIR/$tdir/a1/f2 was not recovered as expected. |
| Comment by nasf (Inactive) [ 30/Jul/14 ] |
|
Side effect of failed test_18c for |