Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18052

osc checksums caused an client evictions during recovery

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Some time ago, Cray found a strange client evictions during recovery after checksum error messages on the server side.
      like

      [   96.724020] LustreError: 168-f: lustre-OST0001: BAD WRITE CHECKSUM: from 12345-198.19.0.133@tcp inode [0x200000408:0x8:0x0] object 0x0:39 extent [187883520-230584319]: client csum 6d74f8c, server csum 718ce1df
      

      after some investigation I found this is caused a special IOR parameters used by Cray testers. Some pages had updated twice - second half first, and first half after it.
      Once replay bulk pages don't disconnected from mapping and don't locked, marked writeback client able to rewrite some portion of page after checksum is calculated.
      It caused a checksum miss much over recovery. Client disconnected and tries to replay again.. this error hit again and again. so client evicted after hard recovery timeout.

      I tries a several ideas to solve this.
      1) page lock for replay and recalculate checksum - caused a deadlock during recovery.
      because some pages may be locked already for next io.

      2) PG_writeback is out of control and might released once page may be part of next IO portion.

      It probably will be good to don't release PG_writeback until replay done in case checksum enabled, but I don't sure about it.
      It probably someone in Whamcloud have a better idea how to solve it.

      Attachments

        Activity

          [LU-18052] osc checksums caused an client evictions during recovery

          "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57049
          Subject: LU-18052 osc: disable checksums on recovery
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6804fe7e1c3c92f212f0b3b58b5ff0aa3ea14830

          gerrit Gerrit Updater added a comment - "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57049 Subject: LU-18052 osc: disable checksums on recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6804fe7e1c3c92f212f0b3b58b5ff0aa3ea14830
          shadow Alexey Lyashkov created issue -

          People

            wc-triage WC Triage
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: