[LU-11169] Data corruption during IOR testing with network error simulation Created: 24/Jul/18 Updated: 19/Jul/19 Resolved: 19/Dec/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During large IOR testing with network fails introduced, Cray found a data corruption issues. second issue is related with cleanup landed as commit 49d8a7ccd73 where "rc" parameter of obd_commit function was replaced with local data, it horror any errors before it. |
| Comments |
| Comment by Gerrit Updater [ 31/Jul/18 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32905 |
| Comment by Gerrit Updater [ 31/Jul/18 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32906 |
| Comment by Andreas Dilger [ 09/Aug/18 ] |
|
In the case of 4MB RPCs that drop some part of the RDMA, doesn't the bulk RPC checksum detect this case and cause the RPC to be resent? |
| Comment by Alexey Lyashkov [ 09/Aug/18 ] |
|
Andreas, you are right. OSC checksum can detect client side problem, and "read" part of server side problem. |
| Comment by Shuichi Ihara (Inactive) [ 10/Aug/18 ] |
|
Interesting. How is file corrupted? could you share exmaple of corrupted file? |
| Comment by Alexey Lyashkov [ 10/Aug/18 ] |
|
it's depend of what you ask. Server side bug found as reading a zero's from file, after bulk transfer error. |
| Comment by Patrick Farrell (Inactive) [ 10/Aug/18 ] |
|
Ihara, Basically, it sometimes fails to notice and resend when there's a transfer error. On reads, this shows up as zeroes read by the client, when the data on disk is correct. On writes, the result would be whatever data was present before the write (or, I believe, zeroes if the write is to a new region of the file). |
| Comment by Alexey Lyashkov [ 10/Aug/18 ] |
|
Patric, In general, random data in file - as pages is unchanged on failed bulk transfer, but client / server think all is OK. zeros is just luck. |
| Comment by Andreas Dilger [ 10/Aug/18 ] |
It should also cause the client to resend a write if one of the RDMAs was missing data (up to the 10x retry limit). |
| Comment by Alexey Lyashkov [ 11/Aug/18 ] |
|
NO. server side part is horror any error on server side, so commit write (who make write really) don't know about errors before it. |
| Comment by Gerrit Updater [ 01/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32906/ |
| Comment by Alexey Lyashkov [ 05/Dec/18 ] |
|
last patch is addressed a theoretical problem, which is impossible now. |
| Comment by Gerrit Updater [ 19/Jul/19 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35571 |