[LU-6233] recovery-small test_10d failed with 'file contents differ' Created: 11/Feb/15  Updated: 20/Jan/17  Resolved: 20/Jan/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

OpenSFS Cluster with two MDSs each with one MDT, three OSSs each with two OSTs and three clients running lustre-master tag 2.6.93 build 2835


Attachments: Text File test10d_client_log.txt     Text File test10d_mds01_log.txt     Text File test10d_mds02_log.txt    
Issue Links:
Duplicate
duplicates LU-6359 recovery-small test_10d: FAIL: wrong ... Resolved
Related
is related to LU-5581 blocking ast error handling lack evic... Resolved
is related to LU-7759 umount hanging in modern distros when... Resolved
Severity: 3
Rank (Obsolete): 17460

 Description   

recovery-small test 10d failed with error message 'file contents differ'. Results and logs are at https://testing.hpdd.intel.com/test_sets/48de3eb8-ade9-11e4-a0b6-5254006e85c2 .

From the client test log, the test output is as expected until:

...
ldlm.namespaces.scratch-OST0005-osc-ffff8807dc5d1000.early_lock_cancel=1
ldlm.namespaces.scratch-OST0005-osc-ffff88080bd5ac00.early_lock_cancel=1
Connected clients:
c13
c12
c11
c13
cmp: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown
 recovery-small test_10d: @@@@@@ FAIL: file contents differ 


 Comments   
Comment by Andreas Dilger [ 11/Feb/15 ]

This test was added in http://review.whamcloud.com/11752 "LU-5581 ldlm: evict clients returning errors on ASTs". We need a debug patch to find out what is going wrong, and whether this has turned a corner error case into a serious problem.

Comment by James Nunez (Inactive) [ 12/Feb/15 ]

I've reproduced this issue with lustre-master tag 2.6.94 and captured logs with full debug from the two MDSs, test10d_mds01_log.txt and test10d_mds02_log.txt, and from the client running recovery-small, test10d_client_log.txt, attached here.

I added cat of the files when this error is hit. You can see below that I can't read /lustre/scratch/f10d.recovery-small ($DIR/$tfile); I get "Cannot send after transport endpoint shutdown" error.

...
Connected clients:
c13
c13
c12
c11
cmp: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown

cat /lustre/scratch/f10d.recovery-small:
cat: /lustre/scratch/f10d.recovery-small: Cannot send after transport endpoint shutdown
end /lustre/scratch/f10d.recovery-small

cat /lustre/scratch2/f10d.recovery-small:
, worldend /lustre/scratch2/f10d.recovery-small
 recovery-small test_10d: @@@@@@ FAIL: file contents differ

I can reproduce this error about one in 10 times running recovery-small.

Comment by Andreas Dilger [ 20/Jan/17 ]

I did a check and recovery-small 10d has passed about 250 times in a row on master.

Generated at Sat Feb 10 01:58:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.