[LU-6238] replay-dual test 10 fails with "FAIL: test_10 failed with 1 " Created: 12/Feb/15  Updated: 11/Sep/20  Resolved: 11/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

OpenSFS cluster with two MDSs each with one MDT, three OSSs each with two OSTs and three clients. Running lustre-master tag 2.6.93.


Issue Links:
Related
is related to LU-6057 replay-dual test_9 failed - post-fail... Resolved
is related to LU-6084 Tests are failed due to 'recovery is ... Resolved
Severity: 3
Rank (Obsolete): 17477

 Description   

replay-dual is failing with "FAIL: test_10 failed with 1". Results are at https://testing.hpdd.intel.com/test_sessions/fff27cc4-addd-11e4-a0b6-5254006e85c2 .

From the client's test log:

== replay-dual test 10: resending a replayed unlink == 06:45:24 (1423061124)
Filesystem           1K-blocks    Used  Available Use% Mounted on
mds01@o2ib:/scratch 1181102496 2829748 1116810332   1% /lustre/scratch
c13: mcreate: cannot create `/lustre/scratch/fsa-c13' with mode 0100644: File exists
c12: mcreate: cannot create `/lustre/scratch/fsa-c12' with mode 0100644: File exists
c11: mcreate: cannot create `/lustre/scratch/fsa-c11' with mode 0100644: File exists
fail_loc=0x80000119
Failing mds1 on mds01
Stopping /lustre/scratch/mdt0 (opts:) on mds01
pdsh@c13: mds01: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to mds01
06:45:44 (1423061144) waiting for mds01 network 900 secs ...
06:45:44 (1423061144) network interface is UP
mount facets: mds1
Starting mds1:   /dev/lvm-sdc/MDT0 /lustre/scratch/mdt0
Started scratch-MDT0000
c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 245 sec
fail_loc=0
 replay-dual test_10: @@@@@@ FAIL: test_10 failed with 1 

What's actually failing is the mcreate in the replay_barrier() function:

replay_barrier() {
        local facet=$1
        do_facet $facet "sync; sync; sync"
	df $MOUNT

        # make sure there will be no seq change                                 
        local clients=${CLIENTS:-$HOSTNAME}
        local f=fsa-\\\$\(hostname\)
        do_nodes $clients "mcreate $MOUNT/$f; rm $MOUNT/$f"
        do_nodes $clients "if [ -d $MOUNT2 ]; then mcreate $MOUNT2/$f; rm $MOUNT2/$f; fi"

Every test session I've checked, about 10, every time test 10 fails with this error, it is preceded by a test 9 failure 'post-failover df: 1' LU-6057. Maybe when test 9 fails, it does not clean up.

Test 10 has failed intermittently with the "mcreate: cannot create * with mode 0100644: File exists" error since December 2014.


Generated at Sat Feb 10 01:58:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.