Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15300

mirror resync can cause EIO to unrelated applications

    XMLWordPrintable

Details

    Description

      I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings.

      first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases).

      in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping.

      AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST).

      so far I've traced all this to the race between two processes:

      • lfs doing resync and changing layout generation
      • another process (say, multiop) doing regular write

      I will cite the logs in a subsequent comment.

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: