We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.
The following scenario happens:
- a client thread writes some data to page N of file X
- page N is transfered to the OSS, the processing thread sleeps somewhere
- the original BRW request timeouts and the client resends page N
- page N is successfully written to disk, the client receives the reply and clears PG_Writeback
- a client thread writes different data to the same page N of file X
- page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
- the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]
A reproducer will be uploaded shortly.
Hello,
Is a backport planned for the b2_12 branch for this issue ?