Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.13.0
-
3
-
9223372036854775807
Description
In testing with SEL port to the 2.12 branch, Grev found an data corruption issue.
I checked it with last master on client side and issue still present.
example of it
[4] FAILED comparison of buffer containing 8-byte ints: [4] File name = /mnt/lustre/SEL/IOR.o [4] In transfer 392, 256 errors between buffer indices 212992 and 213247. [4] File byte offset = 5454028800: [4] Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018 [4] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414)
current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST.
this quick and dirty fix resolves an issue but KMS problem needs invested.
@@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io, GOTO(out, result); } +#if 0 /* * If are writing to a new page, no need to read old data. * The extent locking will have updated the KMS, and for our * purposes here we can treat it like i_size. */ - if (attr->cat_kms <= offset) { + if (attr->cat_kms < offset) { char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0); memset(kaddr, 0, cl_page_size(obj)); ll_kunmap_atomic(kaddr, KM_USER0); + CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset); GOTO(out, result = 0); } +#endif
00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048 00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456 00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048 and brw sends a full page to the OST 00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520]
it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem.