[LU-12681] Data corruption - due incorrect KMS with SEL files Created: 22/Aug/19 Updated: 19/Jul/21 Resolved: 01/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alexey Lyashkov | Assignee: | Vitaly Fertman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
In testing with SEL port to the 2.12 branch, Grev found an data corruption issue. [4] FAILED comparison of buffer containing 8-byte ints: [4] File name = /mnt/lustre/SEL/IOR.o [4] In transfer 392, 256 errors between buffer indices 212992 and 213247. [4] File byte offset = 5454028800: [4] Expected: 0x000000045d5e330f 00000000001a0008 000000045d5e330f 00000000001a0018 [4] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 ior ERROR: data check error, aborting execution, errno 0, Success (ior.c:414) current step of investigation, KMS don't valid in some cases and ll_prepare_partial_page fill a full page with zero, while part of them already send to the OST. @@ -598,25 +597,30 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io, GOTO(out, result); } +#if 0 /* * If are writing to a new page, no need to read old data. * The extent locking will have updated the KMS, and for our * purposes here we can treat it like i_size. */ - if (attr->cat_kms <= offset) { + if (attr->cat_kms < offset) { char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0); memset(kaddr, 0, cl_page_size(obj)); ll_kunmap_atomic(kaddr, KM_USER0); + CDEBUG(D_INFO, "kms-skip %llu <> %llu\n", attr->cat_kms, offset); GOTO(out, result = 0); } +#endif 00000080:00200000:1.0:1566400964.212391:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891347456, len 2048, copied 2048 00000080:00000040:1.0:1566400964.407924:0:28647:0:(rw26.c:611:ll_prepare_partial_page()) kms-skip 3643416576 <> 3891347456 00000080:00200000:1.0:1566400964.407925:0:28647:0:(rw26.c:833:ll_write_end()) pos 3891349504, len 2048, copied 2048 and brw sends a full page to the OST 00000008:00000040:1.0:1566400964.653615:0:28647:0:(osc_request.c:1556:osc_brw_prep_request()) buff[1] [3891347456,966656,520] it not a problem with offset on same page, I tries to fix this check against real offset - but sometimes KMS result is very strange like KMS = 3G while offset point is 5G, so it's something like a more complex problem. |
| Comments |
| Comment by Alexey Lyashkov [ 23/Aug/19 ] |
|
It looks one ldlm lock isn't canceled with layout change as expected. KMS reset to zero in this event, but new update isn't coming, or coming with wrong data, s KMS settings limited to new lock range. |
| Comment by Alexey Lyashkov [ 23/Aug/19 ] |
|
it looks we have several problems. 2) I have few (a specially two hits). When first thread send a full page to the OST, but second page update isn't send. It don't invested currently, work in progress. 3) DoM lock lost
if (lock->l_ast_data) {
struct cl_attr attr;
struct cl_object *obj = lock->l_ast_data;
/* Losing the lock, set KMS to 0 */
cl_object_attr_lock(obj);
attr.cat_kms = 0;
cl_object_attr_update(env, obj, &attr, CAT_KMS);
cl_object_attr_unlock(obj);
}
I don't have reproducer for this case, but it looks writing by half page at end of file with flush a DoM lock in parallel (LRU as example) should corrupt data easy. |
| Comment by Alexey Lyashkov [ 27/Aug/19 ] |
|
It looks story of this corruption started with CLIO. It disconnect ldlm locks when CL object destroyed. so ldlm locks with stale lvb info was exist in cache, and can connected to the new cl object via ldlm_lock_match in osc_enqueue_base. it looks loi_kms_valid designed to solve it problem but it don't work correctly due osc_lock_lvb_update which limit kms just for lock policy region which can be smaller than maximal known offset. Vitaly and I started with looking to solution to update lvb attributes in lock with local update. |
| Comment by Patrick Farrell (Inactive) [ 27/Aug/19 ] |
|
Hmm... But the LVB info can be stale the moment it is created, right? Why is it a problem only if it's been disconnected first...? |
| Comment by Alexey Lyashkov [ 27/Aug/19 ] |
|
@Patrik, LVB is stale just for PR lock. you are KMS/Size owner for PW lock. lock enq store a LVB info in LOV object which updated via osc_page_touch_at / brw_interpret. |
| Comment by Patrick Farrell (Inactive) [ 27/Aug/19 ] |
|
Hmm, but a PR lock can protect size within a region, since you need a PW lock (which conflicts) to update size. And a PW lock only protects size in the area covered by the lock... |
| Comment by Alexey Lyashkov [ 27/Aug/19 ] |
|
just in case you lock protects an EOF, in other cases you size is stale, but KMS is actual for PW, as KMS is actual for the region covered by any lock. |
| Comment by Patrick Farrell (Inactive) [ 10/Sep/19 ] |
|
Do you guys have a reproducer on master for this? I'd be curious to maybe take a closer look |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36198 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36199 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/36200 |
| Comment by Vitaly Fertman [ 17/Sep/19 ] |
|
The problem is regularly reproduced with IOR running with transfer size not aligned by page size and also being run on a SEL layout. e.g.: 1st problem offset: the happened IO is: i.e. each client writes 33 * 28322K, each 28322K are written by 1666K the file by itself:
the problem is that first 2k of the block are zeros, the next 2k in the block is a start of the next transfersize (2666450944+2k / 1666k = 1563) however this is not a segment border, thus written by the same client. SEL is involved just because it triggers layout changes regularly, however the problem originally belongs to the layout lock handling in CLIO - osc objects do not survive after layout lock change and the cached lvb attributes are lost, with the info about the modifications have been made; new osc objects are re-populated by the lvb cache from the outdated lock's lvb. a later preparation of a partial page checks KMS which is low enough to not read the page in advance - the left part is just left 0-ed. the patch descriptions have more detailed explanations of the particular scenarios being fixed. |
| Comment by Shuichi Ihara [ 19/Sep/19 ] |
|
I got a similar problem on unaligned single shared file workload with IOR. |
| Comment by Alexey Lyashkov [ 19/Sep/19 ] |
|
It's very generic problem, primary case is PFL or SEL, where layout changed sometimes. |
| Comment by Shuichi Ihara [ 19/Sep/19 ] |
|
OK, so patch for |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
Ihara, I don't think so - This bug results mostly in zeroes in partial page reads. In most scenarios, this is transient corruption that stays on the client, and doesn't go to disk. (It can go to disk, but only under special scenarios.) You have much larger regions of bad data (most of a write in all cases I saw), the data is bad on disk, and the data is not zeroes, instead it's mostly random garbage. (It is in one case, but it's possible for unwritten regions of disk to contain zeroes rather than garbage.) |
| Comment by Alexey Lyashkov [ 19/Sep/19 ] |
|
Partick, you are wrong. This bug on partial page write. POSIX requests a zero page before modify. static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io, struct cl_page *pg, struct file *file) { /* * If are writing to a new page, no need to read old data. * The extent locking will have updated the KMS, and for our * purposes here we can treat it like i_size. */ if (attr->cat_kms <= offset) { char *kaddr = ll_kmap_atomic(vpg->vpg_page, KM_USER0); memset(kaddr, 0, cl_page_size(obj)); ll_kunmap_atomic(kaddr, KM_USER0); GOTO(out, result = 0); } it caused a zeros will send to OST and to the disk. |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
Sorry, I meant the read required to do a partial page write. I know what you're talking about. The read is of the full page, but then the write overwrites part of it. So that's why I said a "partial page read", but that is incorrect, as you pointed out. My point about the corruption being limited in this case to just zeroes, not garbage data, still stands. That's why |
| Comment by Gerrit Updater [ 30/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36199/ |
| Comment by Gerrit Updater [ 30/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36200/ |
| Comment by Peter Jones [ 01/Oct/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 24/Nov/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40739 |
| Comment by Gerrit Updater [ 24/Nov/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40740 |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40739/ |
| Comment by Gerrit Updater [ 17/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40740/ |