[LU-6545] MPIIO short reads Created: 29/Apr/15  Updated: 15/Oct/15  Resolved: 15/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Client: 2.4.3
server: 2.4.3


Issue Links:
Related
is related to LU-6389 read()/write() returning less than av... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One of our filesystem is experiencing, what we guess are, short reads that results in NaNs using mpiio function call 'MPI_FILE_READ_AT_ALL'

This can be reproduced every time if the data is read from disk and not cache. So doing a echo 1 > /proc/sys/vm/drop_caches then running the code will error every time, but running the code a second or third time will not produce the error.

NOTE:
This occurs only when the file is striped >1 ost.
In the debug logs the datafile has a fid of [0x2000b2ebc:0x358:0x0]
During the debugging I disabled read ahead

I have captured a full debug trace of lustre on the client and will upload to tftp site.



 Comments   
Comment by Mahmoud Hanafi [ 29/Apr/15 ]

debug logs uploaded to /uploads/LU6545/r401i0n14.failure2.gz

Comment by Jinshan Xiong (Inactive) [ 29/Apr/15 ]

Is this LU-6389?

Comment by Mahmoud Hanafi [ 30/Apr/15 ]

It is the same but we have a %100 reproducer

Comment by Patrick Farrell (Inactive) [ 30/Apr/15 ]

Mahmoud - Are you able to share a reproducer publicly? It would be nice to have if so.

Comment by Peter Jones [ 01/May/15 ]

As per the discussion on today's call with NASA they are going to try out the fix from LU-6389 to see if it fixes the issues exposed by their reproducer.

Comment by Mahmoud Hanafi [ 01/May/15 ]

We tested 2.4.3 build and it has fixed the issue. Debugging showed that the read restarts are getting triggeted.

We will test 2.5.3 client next week

Comment by Mahmoud Hanafi [ 01/May/15 ]

I will upload the reproducer next week

Comment by Peter Jones [ 15/Oct/15 ]

As per NASA fix worked

Generated at Sat Feb 10 02:01:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.