[LU-1320] EIO on read shortly after file written - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2
Affects Version/s: Lustre 2.1.0
Labels:
- paj
Environment:
Lustre 2.1.0-24chaos on both clients and servers. http://github.com/chaos/lustre

Severity:
3
Rank (Obsolete):
4643

Description

Some of our most important users are seeing read() return EIO quite frequently, which completely ruins their job run.

The application uses an IO library to write a file to lustre. After writing, it closes the file. It then immediately reopens the file, reads the contents again and calculates a checksum to verify that the data is correct.

During the read phase, it will more-or-less randomly get an EIO on read and abort the entire job.

Both the write and read are performed on the same client, by the same thread. There are usually 16 threads, all writing and reading their own files.

There are no console messages on the client that give any clues to where the problem might be in lustre. There do not appear to be any evictions that correlate with the read error. A second read of the file will succeed and the checksum is correct, so this is a transient problem.

I am diving into the CLIO code, but it is all new to me so I could use some tips for where to start my debugging. Perhaps I should start with enabling vfstrace and rpctrace, and adding code to dump the lustre log when vvp_io_read_page() returns EIO...

Although this is only reproducible on the secure network, so code changes are going to be difficult to implement.

Attachments

Issue Links

Trackbacks

Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....

Activity

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Apr/12 11:20 PM

Updated:: 02/May/12 3:37 PM

Resolved:: 30/Apr/12 12:04 PM