[LU-12681] Data corruption - due incorrect KMS with SEL files Created: 22/Aug/19  Updated: 19/Jul/21  Resolved: 01/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.7

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Vitaly Fertman
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-10070 PFL self-extending file layout Resolved
is related to LU-13645 Various data corruptions possible in ... Resolved
is related to LU-12786 io hard read fails due to data verifi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In testing with SEL port to the 2.12 branch, Grev found an data corruption issue.
I checked it with last master on client side and issue still present.
example of it

[4] FAILED comparison of buffer containing 8-byte ints:
[4]   File name = /mnt/lustre/SEL/IOR.o
[4]   In transfer 392, 256 errors between buffer indices 212992 and 213247.
[4]   File byte offset = 5454028800:
[4]     Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018
[4]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414)

current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST.
this quick and dirty fix resolves an issue but KMS problem needs invested.

@@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                GOTO(out, result);
        }

+#if 0
        /*
         * If are writing to a new page, no need to read old data.
         * The extent locking will have updated the KMS, and for our
         * purposes here we can treat it like i_size.
         */
-       if (attr->cat_kms <= offset) {
+       if (attr->cat_kms < offset) {
                char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);

                memset(kaddr, 0, cl_page_size(obj));
                ll_kunmap_atomic(kaddr, KM_USER0);
+               CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset);
                GOTO(out, result = 0);
        }
+#endif
00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048
00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456
00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048
and brw sends a full page to the OST
00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520]

it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem.



 Comments   
Comment by Alexey Lyashkov [ 23/Aug/19 ]

It looks one ldlm lock isn't canceled with layout change as expected. KMS reset to zero in this event, but new update isn't coming, or coming with wrong data, s KMS settings limited to new lock range.

Comment by Alexey Lyashkov [ 23/Aug/19 ]

it looks we have several problems.
1) layout change.
LOV/OSC stores an KMS info inside of layout, but this info reallocated / zeroed during layout change process.
ldlm locks detached to ability reconnect and refresh lvb if needs, but lock info is stale and don't account a local writes under this lock. It looks we need some other place (in lock probably) where osc_touch_page_at() store an info.
problem affects any PFL files

2) I have few (a specially two hits). When first thread send a full page to the OST, but second page update isn't send. It don't invested currently, work in progress.

3) DoM lock lost
DoM lock lost will reset a KMS to zero, it's nice for DoM only file, but PFL file with DoM component first will destroy KMS on LOV/OSC part. osc uses ldlm_extent_shift for same reason, and it looks better approach.

        if (lock->l_ast_data) {
                struct cl_attr attr;
                struct cl_object *obj = lock->l_ast_data;
                /* Losing the lock, set KMS to 0 */
                cl_object_attr_lock(obj);
                attr.cat_kms = 0;
                cl_object_attr_update(env, obj, &attr, CAT_KMS);
                cl_object_attr_unlock(obj);
        }

I don't have reproducer for this case, but it looks writing by half page at end of file with flush a DoM lock in parallel (LRU as example) should corrupt data easy.

Comment by Alexey Lyashkov [ 27/Aug/19 ]

It looks story of this corruption started with CLIO. It disconnect ldlm locks when CL object destroyed. so ldlm locks with stale lvb info was exist in cache, and can connected to the new cl object via ldlm_lock_match in osc_enqueue_base.

it looks loi_kms_valid designed to solve it problem but it don't work correctly due osc_lock_lvb_update which limit kms just for lock policy region which can be smaller than maximal known offset.

Vitaly and I started with looking to solution to update lvb attributes in lock with local update.
other approach is moving stripe info from lov object to ldlm resource.

Comment by Patrick Farrell (Inactive) [ 27/Aug/19 ]

Hmm...  But the LVB info can be stale the moment it is created, right?  Why is it a problem only if it's been disconnected first...?

Comment by Alexey Lyashkov [ 27/Aug/19 ]

@Patrik, LVB is stale just for PR lock. you are KMS/Size owner for PW lock.

lock enq store a LVB info in LOV object which updated via osc_page_touch_at / brw_interpret.

Comment by Patrick Farrell (Inactive) [ 27/Aug/19 ]

Hmm, but a PR lock can protect size within a region, since you need a PW lock (which conflicts) to update size.  And a PW lock only protects size in the area covered by the lock...

Comment by Alexey Lyashkov [ 27/Aug/19 ]

just in case you lock protects an EOF, in other cases you size is stale, but KMS is actual for PW, as KMS is actual for the region covered by any lock.

Comment by Patrick Farrell (Inactive) [ 10/Sep/19 ]

Do you guys have a reproducer on master for this?  I'd be curious to maybe take a closer look

Comment by Gerrit Updater [ 16/Sep/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36198
Subject: LU-12681 mdc: kms_ignore cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a90d0943a096e63ff06e4f13eb7f1a64face7fb

Comment by Gerrit Updater [ 16/Sep/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36199
Subject: LU-12681 osc: wrong cache of LVB attrs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52c2dc7777924b47572ef70e04e87385243c836a

Comment by Gerrit Updater [ 16/Sep/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36200
Subject: LU-12681 osc: wrong cache of LVB attrs, part2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ddd47605588677902d2ae04ec187f9ca90d2d83

Comment by Vitaly Fertman [ 17/Sep/19 ]

The problem is regularly reproduced with IOR running with transfer size not aligned by page size and also being run on a SEL layout. e.g.:

1st problem offset:
redpill-client01: [27] At transfer buffer #32, index #212992 (file byte offset 2666450944):

the happened IO is:
redpill-client01: api = POSIX
redpill-client01: test filename = /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior
redpill-client01: access = single-shared-file
redpill-client01: pattern = strided (33 segments)
redpill-client01: ordering in a file = sequential offsets
redpill-client01: ordering inter file=constant task offsets = 1
redpill-client01: clients = 48 (8 per node)
redpill-client01: repetitions = 2
redpill-client01: xfersize = 1.63 MiB
redpill-client01: blocksize = 27.66 MiB
redpill-client01: aggregate filesize = 42.78 GiB
redpill-client01:
redpill-client01: Using Time Stamp 1565463864 (0x5d4f1538) for Data Signature
redpill-client01: Commencing write performance test.
redpill-client01: Sat Aug 10 19:04:24 2019
redpill-client01:
redpill-client01: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
redpill-client01: ------ --------- ---------- --------- -------- -------- -------- -------- ----
redpill-client01: write 171.50 28322 1666.00 0.015037 255.36 0.116421 255.46 0 XXCEL

i.e. each client writes 33 * 28322K, each 28322K are written by 1666K

the file by itself:
lfs getstripe -y /mnt/fs1//ha.sh-94333/redpill-client01-ior/f.ior
lcm_layout_gen: 64
lcm_mirror_count: 1
lcm_entry_count: 3

  • component0:
    lcme_id: 1
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 0
    lcme_extent.e_end: 67108864
    sub_layout:
    lmm_stripe_count: 1
    lmm_stripe_size: 1048576
    lmm_pattern: raid0
    lmm_layout_gen: 0
    lmm_stripe_offset: 0
    lmm_objects:
    l_ost_idx: 0
    l_fid: 0x100000000:0x23a551c4:0x0
  • component1:
    lcme_id: 3
    lcme_mirror_id: 0
    lcme_flags: init
    lcme_extent.e_start: 67108864
    lcme_extent.e_end: 45969571840
    sub_layout:
    lmm_stripe_count: 1
    lmm_stripe_size: 1048576
    lmm_pattern: raid0
    lmm_layout_gen: 0
    lmm_stripe_offset: 6
    lmm_objects:
    l_ost_idx: 6
    l_fid: 0x100060000:0x2368b93b:0x0
  • component2:
    lcme_id: 4
    lcme_mirror_id: 0
    lcme_flags: extension
    lcme_extent.e_start: 45969571840
    lcme_extent.e_end: EOF
    sub_layout:
    lmm_stripe_count: 0
    lmm_extension_size: 134217728
    lmm_pattern: raid0
    lmm_layout_gen: 0
    lmm_stripe_offset: -1

the problem is that first 2k of the block are zeros, the next 2k in the block is a start of the next transfersize (2666450944+2k / 1666k = 1563) however this is not a segment border, thus written by the same client.
it means a half of a page at the end of one transfersize chunk of data is lost when followed by another transfersize chunk of data from the same client.

SEL is involved just because it triggers layout changes regularly, however the problem originally belongs to the layout lock handling in CLIO - osc objects do not survive after layout lock change and the cached lvb attributes are lost, with the info about the modifications have been made; new osc objects are re-populated by the lvb cache from the outdated lock's lvb. a later preparation of a partial page checks KMS which is low enough to not read the page in advance - the left part is just left 0-ed.

the patch descriptions have more detailed explanations of the particular scenarios being fixed.

Comment by Shuichi Ihara [ 19/Sep/19 ]

I got a similar problem on unaligned single shared file workload with IOR. LU-12786
I'm not using SEL, but is LU-12681 general problem or specific in SEL?

Comment by Alexey Lyashkov [ 19/Sep/19 ]

It's very generic problem, primary case is PFL or SEL, where layout changed sometimes.
But in some cases this error can hit on any client with CLIO, so lustre 2.0+ affected.
This needs to flush inode from inode cache, but ldlm locks need to stay in cache, likely pined with IO.
I remember similar problems in long time ago in past, when inode info was moved from lock to the ldlm resource.

Comment by Shuichi Ihara [ 19/Sep/19 ]

OK, so patch for LU-12681 would be worth to try for my workload, then?

Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ]

Ihara,

I don't think so - This bug results mostly in zeroes in partial page reads.  In most scenarios, this is transient corruption that stays on the client, and doesn't go to disk.  (It can go to disk, but only under special scenarios.)  You have much larger regions of bad data (most of a write in all cases I saw), the data is bad on disk, and the data is not zeroes, instead it's mostly random garbage.  (It is in one case, but it's possible for unwritten regions of disk to contain zeroes rather than garbage.)

Comment by Alexey Lyashkov [ 19/Sep/19 ]

Partick,

you are wrong. This bug on partial page write. POSIX requests a zero page before modify.
and lustre does it in the

static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
                                   struct cl_page *pg, struct file *file)
{
        /*
         * If are writing to a new page, no need to read old data.
         * The extent locking will have updated the KMS, and for our
         * purposes here we can treat it like i_size.
         */
        if (attr->cat_kms <= offset) {
                char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0);

                memset(kaddr, 0, cl_page_size(obj));
                ll_kunmap_atomic(kaddr, KM_USER0);
                GOTO(out, result = 0);
        }

it caused a zeros will send to OST and to the disk.

Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ]

Sorry, I meant the read required to do a partial page write.  I know what you're talking about.

The read is of the full page, but then the write overwrites part of it.  So that's why I said a "partial page read", but that is incorrect, as you pointed out.

My point about the corruption being limited in this case to just zeroes, not garbage data, still stands.  That's why LU-12786 doesn't match here.

Comment by Gerrit Updater [ 30/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36199/
Subject: LU-12681 osc: wrong cache of LVB attrs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8ac020df4592fc6e85edd75d54cb3795a4e50f8e

Comment by Gerrit Updater [ 30/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36200/
Subject: LU-12681 osc: wrong cache of LVB attrs, part2
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 40319db5bc649adaf3dad066e2c1bb49f7f1c04a

Comment by Peter Jones [ 01/Oct/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 24/Nov/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40739
Subject: LU-12681 osc: wrong cache of LVB attrs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 5aa6767d63368e630531669e51b4c8e11e1788b8

Comment by Gerrit Updater [ 24/Nov/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40740
Subject: LU-12681 osc: wrong cache of LVB attrs, part2
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: e9580cc6f49d185e080ff4139176aa9369a4e3ca

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40739/
Subject: LU-12681 osc: wrong cache of LVB attrs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d9329ccb757f06f0ff9df2a2c1b627bb29015c81

Comment by Gerrit Updater [ 17/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40740/
Subject: LU-12681 osc: wrong cache of LVB attrs, part2
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8c74fe2c6c3f8472ddf39f6ccc5e9c3744de344d

Generated at Sat Feb 10 02:54:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.