Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.
The following scenario happens:
- a client thread writes some data to page N of file X
- page N is transfered to the OSS, the processing thread sleeps somewhere
- the original BRW request timeouts and the client resends page N
- page N is successfully written to disk, the client receives the reply and clears PG_Writeback
- a client thread writes different data to the same page N of file X
- page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
- the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]
A reproducer will be uploaded shortly.