Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15300

mirror resync can cause EIO to unrelated applications

XMLWordPrintable

    • 3
    • 9223372036854775807

      I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings.

      first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases).

      in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping.

      AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST).

      so far I've traced all this to the race between two processes:

      • lfs doing resync and changing layout generation
      • another process (say, multiop) doing regular write

      I will cite the logs in a subsequent comment.

            bzzz Alex Zhuravlev
            bzzz Alex Zhuravlev
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: