Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
To dump all of the data from the bad RPCs on the client and server, that would be enabled by a /proc control (off by default), like /proc/fs/lustre/osc/<target>/checksum_dump and /proc/fs/lustre/ost/<target>/checksum_dump so that we get both sides of the xfer to compare.
When a bad checksum is hit it would (in a manner similar to how we dump logs on lbug) write a file like /tmp/[fid]:[offset-range]-clientcksum-servercksum on both server and client, if this file does not yet exist (so there will only be one file per node no matter how many retransmits there were).
The file will get the page content from the RPC, and then we can compare the RPC data on server and client and see what changed in between them to perhaps gain better insight into what's going on.
Per-page Intermediate/partial cksums will also be printed during error breakdown and on both sides to help determine where starts the drift.