Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Some time ago, Cray found a strange client evictions during recovery after checksum error messages on the server side.
like
[ 96.724020] LustreError: 168-f: lustre-OST0001: BAD WRITE CHECKSUM: from 12345-198.19.0.133@tcp inode [0x200000408:0x8:0x0] object 0x0:39 extent [187883520-230584319]: client csum 6d74f8c, server csum 718ce1df
after some investigation I found this is caused a special IOR parameters used by Cray testers. Some pages had updated twice - second half first, and first half after it.
Once replay bulk pages don't disconnected from mapping and don't locked, marked writeback client able to rewrite some portion of page after checksum is calculated.
It caused a checksum miss much over recovery. Client disconnected and tries to replay again.. this error hit again and again. so client evicted after hard recovery timeout.
I tries a several ideas to solve this.
1) page lock for replay and recalculate checksum - caused a deadlock during recovery.
because some pages may be locked already for next io.
2) PG_writeback is out of control and might released once page may be part of next IO portion.
It probably will be good to don't release PG_writeback until replay done in case checksum enabled, but I don't sure about it.
It probably someone in Whamcloud have a better idea how to solve it.
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57049
Subject: LU-18052 osc: disable checksums on recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6804fe7e1c3c92f212f0b3b58b5ff0aa3ea14830