Affects Version/s: None
Fix Version/s: Lustre 2.14.0
We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.
The following scenario happens:
- a client thread writes some data to page N of file X
- page N is transfered to the OSS, the processing thread sleeps somewhere
- the original BRW request timeouts and the client resends page N
- page N is successfully written to disk, the client receives the reply and clears PG_Writeback
- a client thread writes different data to the same page N of file X
- page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
- the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]
A reproducer will be uploaded shortly.