Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6233

recovery-small test_10d failed with 'file contents differ'

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • OpenSFS Cluster with two MDSs each with one MDT, three OSSs each with two OSTs and three clients running lustre-master tag 2.6.93 build 2835
    • 3
    • 17460

    Description

      recovery-small test 10d failed with error message 'file contents differ'. Results and logs are at https://testing.hpdd.intel.com/test_sets/48de3eb8-ade9-11e4-a0b6-5254006e85c2 .

      From the client test log, the test output is as expected until:

      ...
      ldlm.namespaces.scratch-OST0005-osc-ffff8807dc5d1000.early_lock_cancel=1
      ldlm.namespaces.scratch-OST0005-osc-ffff88080bd5ac00.early_lock_cancel=1
      Connected clients:
      c13
      c12
      c11
      c13
      cmp: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown
       recovery-small test_10d: @@@@@@ FAIL: file contents differ 
      

      Attachments

        Issue Links

          Activity

            [LU-6233] recovery-small test_10d failed with 'file contents differ'

            I did a check and recovery-small 10d has passed about 250 times in a row on master.

            adilger Andreas Dilger added a comment - I did a check and recovery-small 10d has passed about 250 times in a row on master.

            I've reproduced this issue with lustre-master tag 2.6.94 and captured logs with full debug from the two MDSs, test10d_mds01_log.txt and test10d_mds02_log.txt, and from the client running recovery-small, test10d_client_log.txt, attached here.

            I added cat of the files when this error is hit. You can see below that I can't read /lustre/scratch/f10d.recovery-small ($DIR/$tfile); I get "Cannot send after transport endpoint shutdown" error.

            ...
            Connected clients:
            c13
            c13
            c12
            c11
            cmp: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown
            
            cat /lustre/scratch/f10d.recovery-small:
            cat: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown
            end /lustre/scratch/f10d.recovery-small
            
            cat /lustre/scratch2/f10d.recovery-small:
            , worldend /lustre/scratch2/f10d.recovery-small
             recovery-small test_10d: @@@@@@ FAIL: file contents differ
            

            I can reproduce this error about one in 10 times running recovery-small.

            jamesanunez James Nunez (Inactive) added a comment - I've reproduced this issue with lustre-master tag 2.6.94 and captured logs with full debug from the two MDSs, test10d_mds01_log.txt and test10d_mds02_log.txt, and from the client running recovery-small, test10d_client_log.txt, attached here. I added cat of the files when this error is hit. You can see below that I can't read /lustre/scratch/f10d.recovery-small ($DIR/$tfile); I get "Cannot send after transport endpoint shutdown" error. ... Connected clients: c13 c13 c12 c11 cmp: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown cat /lustre/scratch/f10d.recovery-small: cat: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown end /lustre/scratch/f10d.recovery-small cat /lustre/scratch2/f10d.recovery-small: , worldend /lustre/scratch2/f10d.recovery-small recovery-small test_10d: @@@@@@ FAIL: file contents differ I can reproduce this error about one in 10 times running recovery-small.

            This test was added in http://review.whamcloud.com/11752 "LU-5581 ldlm: evict clients returning errors on ASTs". We need a debug patch to find out what is going wrong, and whether this has turned a corner error case into a serious problem.

            adilger Andreas Dilger added a comment - This test was added in http://review.whamcloud.com/11752 " LU-5581 ldlm: evict clients returning errors on ASTs". We need a debug patch to find out what is going wrong, and whether this has turned a corner error case into a serious problem.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: