[LU-12681] Data corruption - due incorrect KMS with SEL files - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.13.0, Lustre 2.12.7
Affects Version/s: Lustre 2.13.0
Labels:
- patch

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In testing with SEL port to the 2.12 branch, Grev found an data corruption issue.
I checked it with last master on client side and issue still present.
example of it

[4] FAILED comparison of buffer containing 8-byte ints:
[4]   File name = /mnt/lustre/SEL/IOR.o
[4]   In transfer 392, 256 errors between buffer indices 212992 and 213247.
[4]   File byte offset = 5454028800:
[4]     Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018
[4]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414)

current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST.
this quick and dirty fix resolves an issue but KMS problem needs invested.

@@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                GOTO(out, result);
        }

+#if 0
        /*
         * If are writing to a new page, no need to read old data.
         * The extent locking will have updated the KMS, and for our
         * purposes here we can treat it like i_size.
         */
-       if (attr->cat_kms <= offset) {
+       if (attr->cat_kms < offset) {
                char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);

                memset(kaddr, 0, cl_page_size(obj));
                ll_kunmap_atomic(kaddr, KM_USER0);
+               CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset);
                GOTO(out, result = 0);
        }
+#endif

00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048
00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456
00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048
and brw sends a full page to the OST
00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520]

it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem.

Attachments

Issue Links

is related to

LU-13645 Various data corruptions possible in lustre.

Resolved

LU-12786 io hard read fails due to data verification fails

Resolved

is related to

LU-10070 PFL self-extending file layout

Resolved

Activity

People

Assignee:: Vitaly Fertman

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 22/Aug/19 6:29 AM

Updated:: 19/Jul/21 5:32 AM

Resolved:: 01/Oct/19 3:20 AM