[LU-6545] MPIIO short reads Created: 29/Apr/15 Updated: 15/Oct/15 Resolved: 15/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Client: 2.4.3 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
One of our filesystem is experiencing, what we guess are, short reads that results in NaNs using mpiio function call 'MPI_FILE_READ_AT_ALL' This can be reproduced every time if the data is read from disk and not cache. So doing a echo 1 > /proc/sys/vm/drop_caches then running the code will error every time, but running the code a second or third time will not produce the error. NOTE: I have captured a full debug trace of lustre on the client and will upload to tftp site. |
| Comments |
| Comment by Mahmoud Hanafi [ 29/Apr/15 ] |
|
debug logs uploaded to /uploads/LU6545/r401i0n14.failure2.gz |
| Comment by Jinshan Xiong (Inactive) [ 29/Apr/15 ] |
|
Is this |
| Comment by Mahmoud Hanafi [ 30/Apr/15 ] |
|
It is the same but we have a %100 reproducer |
| Comment by Patrick Farrell (Inactive) [ 30/Apr/15 ] |
|
Mahmoud - Are you able to share a reproducer publicly? It would be nice to have if so. |
| Comment by Peter Jones [ 01/May/15 ] |
|
As per the discussion on today's call with NASA they are going to try out the fix from |
| Comment by Mahmoud Hanafi [ 01/May/15 ] |
|
We tested 2.4.3 build and it has fixed the issue. Debugging showed that the read restarts are getting triggeted. We will test 2.5.3 client next week |
| Comment by Mahmoud Hanafi [ 01/May/15 ] |
|
I will upload the reproducer next week |
| Comment by Peter Jones [ 15/Oct/15 ] |
|
As per NASA fix worked |