Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
When there is a server-side corruption of pages in the read cache, particularly with T10-PI, it appears that we do not properly handle this case. The client will detect the corruption due to the RPC checksum mismatch, and will resend the RPC, but will re-read the same data from the cache each time. If the server is using the (incorrect) GRD tags on the pages to generate the RPC checksum, the RPC checksum will consistently be incorrect:
nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80 nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80 nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80 nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80 nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80 nbp17-OST0065: BAD READ CHECKSUM: from [10.151.27.142@o2ib] inode [0x20000948a:0x3:0x0] object 0x0:141666 extent [503316480-1509953535], client 73006b, server 10500b1, cksum_type 80
What should happen in this case is that if the client sends the OBD_FL_RECOV_RESEND flag in the OST_READ RPC, then the server should discard any cached pages in that range from cache, re-read the pages/sectors from the underlying storage (without using the cache), and then verify the GRD tags for each sector locally (calculate in osd-ldiskfs and compare to the GRD tags returned by the kernel), and print an error immediately about which sector(s) do not match, instead of depending on the client to do this again.
It would be useful to be able to (somehow) send a block command (FUA?) to also flush the SFA cache in this case, but that would need some help from the SFA team, and still depends on Lustre handling this correctly.