Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5209

sanity-lfsck test_18d failure: Expect file2 size 4, but got 0

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.6.0
    • Lustre 2.5.60 on the OpenSFS cluster, CentOS 6.5 with one server (mds01) with a MGS and MDS with two MDTs, another server (mds02) with MDS and two MDTs, four OSSs with two OSTs each and four clients.
    • 3
    • 14538

    Description

      Running sanity-lfsck with the stated environment, tests 18c, 18d, 18e, and 19a fail and test 19b hangs. Test results are at https://maloo.whamcloud.com/test_sessions/5ad54b54-f5a5-11e3-b29e-52540035b04c .

      sanity-lfsck test 18d fails with the error:

      sanity-lfsck test_18d: @@@@@@ FAIL: (6) Expect file2 size 4, but got 0
      

      The file size of 0 is correct

      ls -il /lustre/scratch/d18d.sanity-lfsck/a1/f2
      216172799293652997 -rw-r--r-- 1 bin bin 0 Jun 16 16:00 /lustre/scratch/d18d.sanity-lfsck/a1/f2
      

      Calling

      do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout 

      before the call to ls shows:

      name: lfsck_layout
      magic: 0xb173ae14
      version: 2
      status: completed
      flags:
      param: all_targets,orphan,create_ostobj
      time_since_last_completed: 2 seconds
      time_since_latest_start: 7 seconds
      time_since_last_checkpoint: 2 seconds
      latest_start_position: 0
      last_checkpoint_position: 25097
      first_failure_position: 0
      success_count: 1
      repaired_dangling: 1
      repaired_unmatched_pair: 0
      repaired_multiple_referenced: 0
      repaired_orphan: 1
      repaired_inconsistent_owner: 0
      repaired_others: 0
      skipped: 0
      failed_phase1: 0
      failed_phase2: 0
      checked_phase1: 9
      checked_phase2: 1
      run_time_phase1: 0 seconds
      run_time_phase2: 5 seconds
      average_speed_phase1: 9 items/sec
      average_speed_phase2: 0 objs/sec
      real-time_speed_phase1: N/A
      real-time_speed_phase2: N/A
      current_position: N/A
      

      Attachments

        Issue Links

          Activity

            [LU-5209] sanity-lfsck test_18d failure: Expect file2 size 4, but got 0

            Side effect of failed test_18c for LU-5208.

            yong.fan nasf (Inactive) added a comment - Side effect of failed test_18c for LU-5208 .

            Here is some debug log on the OST:

            00000001:02000400:4.0:1405109211.079298:0:12374:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == sanity-lfsck test 18d: Find out orphan OST-object and repair it (4) == 13:06:50 (1405109210)
            ...
            00000100:00100000:6.0:1405109211.397608:0:11561:0:(service.c:2094:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+6:31208:x1473357800770112:12345-192.168.2.119@o2ib:4
            00002000:02000000:2.0:1405109211.397834:0:11561:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=1617, val=0***
            00000100:00100000:2.0:1405109211.489455:0:11561:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+7:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 Request procesed in 91846us (91906us total) trans 21474836902 rc 0/0
            00000100:00100000:2.0:1405109211.489461:0:11561:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.2.119@o2ib, seq: 17
            ...
            

            According to the log, when the test file $DIR/$tdir/a1/f1 and $DIR/$tdir/a1/f2 were created, the PFID xattr were not set because of the failure injection (cfs_fail_loc=1617) that was the side effect from former failed test_18c, not the expected environment. So the subsequent LFSCK will use new FID as the orphan OST-object's parent ($DIR/$tdir/a1/f2). That is why the $DIR/$tdir/a1/f2 was not recovered as expected.

            yong.fan nasf (Inactive) added a comment - Here is some debug log on the OST: 00000001:02000400:4.0:1405109211.079298:0:12374:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == sanity-lfsck test 18d: Find out orphan OST-object and repair it (4) == 13:06:50 (1405109210) ... 00000100:00100000:6.0:1405109211.397608:0:11561:0:(service.c:2094:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+6:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 00002000:02000000:2.0:1405109211.397834:0:11561:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=1617, val=0*** 00000100:00100000:2.0:1405109211.489455:0:11561:0:(service.c:2144:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_003:6717840f-d1fd-6137-30d8-dbad9984cccd+7:31208:x1473357800770112:12345-192.168.2.119@o2ib:4 Request procesed in 91846us (91906us total) trans 21474836902 rc 0/0 00000100:00100000:2.0:1405109211.489461:0:11561:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.2.119@o2ib, seq: 17 ... According to the log, when the test file $DIR/$tdir/a1/f1 and $DIR/$tdir/a1/f2 were created, the PFID xattr were not set because of the failure injection (cfs_fail_loc=1617) that was the side effect from former failed test_18c, not the expected environment. So the subsequent LFSCK will use new FID as the orphan OST-object's parent ($DIR/$tdir/a1/f2). That is why the $DIR/$tdir/a1/f2 was not recovered as expected.

            According to the "do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout" output, all are OK. The orphan has been found, but the original file "f2" was not recovered as expected. It is possible that the found orphan was not for the "f2", instead, it is probably the one left from the test_18c. In test_18c, we expect to find 3 orphans, but only 2 orphans were found, and this one in the test_19d may be that missed one.

            I need the LFSCK log (which is enabled by default on the latest master) to analysis how the LFSCK repaired the inconsistency.

            On the other hand, I suspected that the failure LU-5208/LU-5209/LU-5210/LU-5211 have some potential relationship. It is quite possible that some former LFSCK test cases failure left some dirty stub in the test environment and cause the subsequent LFSCK test cases failed. So let's focus on the first failure in LU-5208 firstly.

            yong.fan nasf (Inactive) added a comment - According to the "do_facet mds1 /usr/sbin/lctl get_param -n mdd.scratch-MDT0000.lfsck_layout" output, all are OK. The orphan has been found, but the original file "f2" was not recovered as expected. It is possible that the found orphan was not for the "f2", instead, it is probably the one left from the test_18c. In test_18c, we expect to find 3 orphans, but only 2 orphans were found, and this one in the test_19d may be that missed one. I need the LFSCK log (which is enabled by default on the latest master) to analysis how the LFSCK repaired the inconsistency. On the other hand, I suspected that the failure LU-5208 / LU-5209 / LU-5210 / LU-5211 have some potential relationship. It is quite possible that some former LFSCK test cases failure left some dirty stub in the test environment and cause the subsequent LFSCK test cases failed. So let's focus on the first failure in LU-5208 firstly.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: