Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5208

sanity-lfsck test_18c failure: Expect 3 fixed on mds1, but got: 2

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.6.0
    • Lustre 2.5.60 on the OpenSFS cluster, CentOS 6.5 with one server (mds01) with a MGS and MDS with two MDTs, another server (mds02) with MDS and two MDTs, four OSSs with two OSTs each and four clients.
    • 3
    • 14536

    Description

      Running sanity-lfsck with the stated environment, tests 18c, 18d, 18e, and 19a fail and test 19b hangs. Test results are at https://maloo.whamcloud.com/test_sessions/5ad54b54-f5a5-11e3-b29e-52540035b04c .

      sanity-lfsck test 18c fails with the error:

      sanity-lfsck test_18c: @@@@@@ FAIL: (4) Expect 3 fixed on mds1, but got: 2
      

      Right before this test fails, the output from /proc/fs/lustre/mdd/scratch-MDT0000/lfsck_layout on mds01, MDT0, is:

      name: lfsck_layout
      magic: 0xb173ae14
      version: 2
      status: completed
      flags:
      param: all_targets,orphan
      time_since_last_completed: 2912 seconds
      time_since_latest_start: 2912 seconds
      time_since_last_checkpoint: 2912 seconds
      latest_start_position: 0
      last_checkpoint_position: 25098
      first_failure_position: 0
      success_count: 1
      repaired_dangling: 0
      repaired_unmatched_pair: 0
      repaired_multiple_referenced: 0
      repaired_orphan: 2
      repaired_inconsistent_owner: 0
      repaired_others: 0
      skipped: 0
      failed_phase1: 0
      failed_phase2: 0
      checked_phase1: 8
      checked_phase2: 2
      run_time_phase1: 0 seconds
      run_time_phase2: 0 seconds
      average_speed_phase1: 8 items/sec
      average_speed_phase2: 2 objs/sec
      real-time_speed_phase1: N/A
      real-time_speed_phase2: N/A
      current_position: N/A
      

      Attachments

        Issue Links

          Activity

            [LU-5208] sanity-lfsck test_18c failure: Expect 3 fixed on mds1, but got: 2

            The patch has been landed to master.

            yong.fan nasf (Inactive) added a comment - The patch has been landed to master.

            It is the test scripts issue, the comment should be "There should NOT be some stub under .lustre/lost+found/MDT0001/". I will update the patch.

            yong.fan nasf (Inactive) added a comment - It is the test scripts issue, the comment should be "There should NOT be some stub under .lustre/lost+found/MDT0001/". I will update the patch.

            I tried patch 11275 and the test passes, but the output and the comments don't match.

            From the output of sanity-lfsck test 18c with this patch:

            Trigger layout LFSCK on all devices to find out orphan OST-object
            Started LFSCK on the device scratch-MDT0000: scrub layout
            There should be some stub under .lustre/lost+found/MDT0001/
            ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory
            There should be some stub under .lustre/lost+found/MDT0000/
            216172799310430210 -r-------- 1 root root 2097152 Aug  4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x2:0x0]-N-0
            216172799310430211 -r-------- 1 root root 2097152 Aug  4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x3:0x0]-N-0
            Resetting fail_loc on all nodes...done.
            PASS 18c (7s)
            

            So, the comment expects something to be in $mount/.lustre/lost+found/scratch-MDT0001, but there is no scratch-MDT0001 subdirectory under lost+found. Maybe with the change in this patch to using "$LFS setstripe -c 1", we shouldn't expect anything there to be an MDT0001 subdirectory?

            I put some debug prints n the test and there is not scratch-MDT0001 subdirectory:

            Trigger layout LFSCK on all devices to find out orphan OST-object
            Started LFSCK on the device scratch-MDT0000: scrub layout
            ls -ail /lustre/scratch/.lustre/lost+found/
            total 8
            144115188109410307 dr-x------ 3 root root 4096 Aug  4 17:37 .
            216172799310430209 drwx------ 3 root root 4096 Aug  4 17:38 MDT0000
            ls -ail /lustre/scratch/.lustre/lost+found/MDT*
            total 4104
            216172799310430209 drwx------ 3 root root    4096 Aug  4 17:38 .
            144115188109410307 dr-x------ 3 root root    4096 Aug  4 17:37 ..
            216172799310430210 -r-------- 1 root root 2097152 Aug  4 17:38 [0x300000401:0x2:0x0]-N-0
            216172799310430211 -r-------- 1 root root 2097152 Aug  4 17:38 [0x300000401:0x3:0x0]-N-0
            There should be some stub under .lustre/lost+found/MDT0001/
            ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory
            
            jamesanunez James Nunez (Inactive) added a comment - I tried patch 11275 and the test passes, but the output and the comments don't match. From the output of sanity-lfsck test 18c with this patch: Trigger layout LFSCK on all devices to find out orphan OST-object Started LFSCK on the device scratch-MDT0000: scrub layout There should be some stub under .lustre/lost+found/MDT0001/ ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory There should be some stub under .lustre/lost+found/MDT0000/ 216172799310430210 -r-------- 1 root root 2097152 Aug 4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x2:0x0]-N-0 216172799310430211 -r-------- 1 root root 2097152 Aug 4 16:42 /lustre/scratch/.lustre/lost+found/MDT0000/[0x300000401:0x3:0x0]-N-0 Resetting fail_loc on all nodes...done. PASS 18c (7s) So, the comment expects something to be in $mount/.lustre/lost+found/scratch-MDT0001, but there is no scratch-MDT0001 subdirectory under lost+found. Maybe with the change in this patch to using "$LFS setstripe -c 1", we shouldn't expect anything there to be an MDT0001 subdirectory? I put some debug prints n the test and there is not scratch-MDT0001 subdirectory: Trigger layout LFSCK on all devices to find out orphan OST-object Started LFSCK on the device scratch-MDT0000: scrub layout ls -ail /lustre/scratch/.lustre/lost+found/ total 8 144115188109410307 dr-x------ 3 root root 4096 Aug 4 17:37 . 216172799310430209 drwx------ 3 root root 4096 Aug 4 17:38 MDT0000 ls -ail /lustre/scratch/.lustre/lost+found/MDT* total 4104 216172799310430209 drwx------ 3 root root 4096 Aug 4 17:38 . 144115188109410307 dr-x------ 3 root root 4096 Aug 4 17:37 .. 216172799310430210 -r-------- 1 root root 2097152 Aug 4 17:38 [0x300000401:0x2:0x0]-N-0 216172799310430211 -r-------- 1 root root 2097152 Aug 4 17:38 [0x300000401:0x3:0x0]-N-0 There should be some stub under .lustre/lost+found/MDT0001/ ls: cannot access /lustre/scratch/.lustre/lost+found/MDT0001/*-N-0: No such file or directory
            yong.fan nasf (Inactive) added a comment - Here is the patch: http://review.whamcloud.com/#/c/11275/
            jamesanunez James Nunez (Inactive) added a comment - sanity-lfsck test logs for 2.6.0-RC1 are at: https://testing.hpdd.intel.com/test_sessions/5e3c96b0-0c68-11e4-9892-5254006e85c2 .

            Please refer to the comment in the LU-5209, we need the LFSCK log to analysis the LFSCK behaviour.

            yong.fan nasf (Inactive) added a comment - Please refer to the comment in the LU-5209 , we need the LFSCK log to analysis the LFSCK behaviour.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: