Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.1.0
-
Lustre 2.1.0-24chaos on both clients and servers. http://github.com/chaos/lustre
-
3
-
4643
Description
Some of our most important users are seeing read() return EIO quite frequently, which completely ruins their job run.
The application uses an IO library to write a file to lustre. After writing, it closes the file. It then immediately reopens the file, reads the contents again and calculates a checksum to verify that the data is correct.
During the read phase, it will more-or-less randomly get an EIO on read and abort the entire job.
Both the write and read are performed on the same client, by the same thread. There are usually 16 threads, all writing and reading their own files.
There are no console messages on the client that give any clues to where the problem might be in lustre. There do not appear to be any evictions that correlate with the read error. A second read of the file will succeed and the checksum is correct, so this is a transient problem.
I am diving into the CLIO code, but it is all new to me so I could use some tips for where to start my debugging. Perhaps I should start with enabling vfstrace and rpctrace, and adding code to dump the lustre log when vvp_io_read_page() returns EIO...
Although this is only reproducible on the secure network, so code changes are going to be difficult to implement.
Attachments
Issue Links
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....