[LU-11169] Data corruption during IOR testing with network error simulation Created: 24/Jul/18  Updated: 19/Jul/19  Resolved: 19/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During large IOR testing with network fails introduced, Cray found a data corruption issues.
first issue is related to the 4MB BRW patchset and exist for long time. Bulk will be marked as failed just with real network error, but if one parts of data was lost and request timeout will treat as transfer done.

second issue is related with cleanup landed as commit 49d8a7ccd73 where "rc" parameter of obd_commit function was replaced with local data, it horror any errors before it.



 Comments   
Comment by Gerrit Updater [ 31/Jul/18 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32905
Subject: LU-11169 ptlrpc: don't treat bulk is ok
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3c26d98e15582b9fc529e220c47e92be0acec5c8

Comment by Gerrit Updater [ 31/Jul/18 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32906
Subject: LU-11169 obdclass: fix old return code usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 028c5661270d922130d80074727c99b9292bd250

Comment by Andreas Dilger [ 09/Aug/18 ]

In the case of 4MB RPCs that drop some part of the RDMA, doesn't the bulk RPC checksum detect this case and cause the RPC to be resent?

Comment by Alexey Lyashkov [ 09/Aug/18 ]

Andreas,

you are right. OSC checksum can detect client side problem, and "read" part of server side problem.

Comment by Shuichi Ihara (Inactive) [ 10/Aug/18 ]

Interesting. How is file corrupted? could you share exmaple of corrupted file?

Comment by Alexey Lyashkov [ 10/Aug/18 ]

it's depend of what you ask. Server side bug found as reading a zero's from file, after bulk transfer error.

Comment by Patrick Farrell (Inactive) [ 10/Aug/18 ]

Ihara,

Basically, it sometimes fails to notice and resend when there's a transfer error.  On reads, this shows up as zeroes read by the client, when the data on disk is correct.  On writes, the result would be whatever data was present before the write (or, I believe, zeroes if the write is to a new region of the file).

Comment by Alexey Lyashkov [ 10/Aug/18 ]

Patric,

In general, random data in file - as pages is unchanged on failed bulk transfer, but client / server think all is OK. zeros is just luck.

Comment by Andreas Dilger [ 10/Aug/18 ]

OSC checksum can detect client side problem, and "read" part of server side problem.

It should also cause the client to resend a write if one of the RDMAs was missing data (up to the 10x retry limit).

Comment by Alexey Lyashkov [ 11/Aug/18 ]

NO. server side part is horror any error on server side, so commit write (who make write really) don't know about errors before it.

Comment by Gerrit Updater [ 01/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32906/
Subject: LU-11169 obdclass: fix old return code usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1db258b57e5669e07934fe848861817f88102475

Comment by Alexey Lyashkov [ 05/Dec/18 ]

last patch is addressed a theoretical problem, which is impossible now.

Comment by Gerrit Updater [ 19/Jul/19 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35571
Subject: LU-11169 ptlrpc: handle reply and resend reorder
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 88c2a9a0c840d01d50762a71f04319e58c9affef

Generated at Sat Feb 10 02:41:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.