Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6238

replay-dual test 10 fails with "FAIL: test_10 failed with 1 "

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0, Lustre 2.10.0
    • None
    • OpenSFS cluster with two MDSs each with one MDT, three OSSs each with two OSTs and three clients. Running lustre-master tag 2.6.93.
    • 3
    • 17477

    Description

      replay-dual is failing with "FAIL: test_10 failed with 1". Results are at https://testing.hpdd.intel.com/test_sessions/fff27cc4-addd-11e4-a0b6-5254006e85c2 .

      From the client's test log:

      == replay-dual test 10: resending a replayed unlink == 06:45:24 (1423061124)
      Filesystem           1K-blocks    Used  Available Use% Mounted on
      mds01@o2ib:/scratch 1181102496 2829748 1116810332   1% /lustre/scratch
      c13: mcreate: cannot create `/lustre/scratch/fsa-c13' with mode 0100644: File exists
      c12: mcreate: cannot create `/lustre/scratch/fsa-c12' with mode 0100644: File exists
      c11: mcreate: cannot create `/lustre/scratch/fsa-c11' with mode 0100644: File exists
      fail_loc=0x80000119
      Failing mds1 on mds01
      Stopping /lustre/scratch/mdt0 (opts:) on mds01
      pdsh@c13: mds01: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to mds01
      06:45:44 (1423061144) waiting for mds01 network 900 secs ...
      06:45:44 (1423061144) network interface is UP
      mount facets: mds1
      Starting mds1:   /dev/lvm-sdc/MDT0 /lustre/scratch/mdt0
      Started scratch-MDT0000
      c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 245 sec
      fail_loc=0
       replay-dual test_10: @@@@@@ FAIL: test_10 failed with 1 
      

      What's actually failing is the mcreate in the replay_barrier() function:

      replay_barrier() {
              local facet=$1
              do_facet $facet "sync; sync; sync"
      	df $MOUNT
      
              # make sure there will be no seq change                                 
              local clients=${CLIENTS:-$HOSTNAME}
              local f=fsa-\\\$\(hostname\)
              do_nodes $clients "mcreate $MOUNT/$f; rm $MOUNT/$f"
              do_nodes $clients "if [ -d $MOUNT2 ]; then mcreate $MOUNT2/$f; rm $MOUNT2/$f; fi"
      

      Every test session I've checked, about 10, every time test 10 fails with this error, it is preceded by a test 9 failure 'post-failover df: 1' LU-6057. Maybe when test 9 fails, it does not clean up.

      Test 10 has failed intermittently with the "mcreate: cannot create * with mode 0100644: File exists" error since December 2014.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: