Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8376

Enhance debugging infos available for Lustre checksum errors

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.10.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      To dump all of the data from the bad RPCs on the client and server, that would be enabled by a /proc control (off by default), like /proc/fs/lustre/osc/<target>/checksum_dump and /proc/fs/lustre/ost/<target>/checksum_dump so that we get both sides of the xfer to compare.

      When a bad checksum is hit it would (in a manner similar to how we dump logs on lbug) write a file like /tmp/[fid]:[offset-range]-clientcksum-servercksum on both server and client, if this file does not yet exist (so there will only be one file per node no matter how many retransmits there were).
      The file will get the page content from the RPC, and then we can compare the RPC data on server and client and see what changed in between them to perhaps gain better insight into what's going on.

      Per-page Intermediate/partial cksums will also be printed during error breakdown and on both sides to help determine where starts the drift.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bfaccini Bruno Faccini (Inactive)
                Reporter:
                bfaccini Bruno Faccini (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: