[LU-18052] osc checksums caused an client evictions during recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Some time ago, Cray found a strange client evictions during recovery after checksum error messages on the server side.
like

[   96.724020] LustreError: 168-f: lustre-OST0001: BAD WRITE CHECKSUM: from 12345-198.19.0.133@tcp inode [0x200000408:0x8:0x0] object 0x0:39 extent [187883520-230584319]: client csum 6d74f8c, server csum 718ce1df

after some investigation I found this is caused a special IOR parameters used by Cray testers. Some pages had updated twice - second half first, and first half after it.
Once replay bulk pages don't disconnected from mapping and don't locked, marked writeback client able to rewrite some portion of page after checksum is calculated.
It caused a checksum miss much over recovery. Client disconnected and tries to replay again.. this error hit again and again. so client evicted after hard recovery timeout.

I tries a several ideas to solve this.
1) page lock for replay and recalculate checksum - caused a deadlock during recovery.
because some pages may be locked already for next io.

2) PG_writeback is out of control and might released once page may be part of next IO portion.

It probably will be good to don't release PG_writeback until replay done in case checksum enabled, but I don't sure about it.
It probably someone in Whamcloud have a better idea how to solve it.

Attachments

Activity

[LU-18052] osc checksums caused an client evictions during recovery

Gerrit Updater added a comment - 16/Nov/24 5:07 AM

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57049
Subject: LU-18052 osc: disable checksums on recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6804fe7e1c3c92f212f0b3b58b5ff0aa3ea14830

Gerrit Updater added a comment - 16/Nov/24 5:07 AM "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57049 Subject: LU-18052 osc: disable checksums on recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6804fe7e1c3c92f212f0b3b58b5ff0aa3ea14830

People

Assignee:: WC Triage

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Jul/24 2:14 PM

Updated:: 16/Nov/24 5:07 AM