Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1320

EIO on read shortly after file written

    XMLWordPrintable

Details

    • 3
    • 4643

    Description

      Some of our most important users are seeing read() return EIO quite frequently, which completely ruins their job run.

      The application uses an IO library to write a file to lustre. After writing, it closes the file. It then immediately reopens the file, reads the contents again and calculates a checksum to verify that the data is correct.

      During the read phase, it will more-or-less randomly get an EIO on read and abort the entire job.

      Both the write and read are performed on the same client, by the same thread. There are usually 16 threads, all writing and reading their own files.

      There are no console messages on the client that give any clues to where the problem might be in lustre. There do not appear to be any evictions that correlate with the read error. A second read of the file will succeed and the checksum is correct, so this is a transient problem.

      I am diving into the CLIO code, but it is all new to me so I could use some tips for where to start my debugging. Perhaps I should start with enabling vfstrace and rpctrace, and adding code to dump the lustre log when vvp_io_read_page() returns EIO...

      Although this is only reproducible on the secure network, so code changes are going to be difficult to implement.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: