[LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order Created: 25/Sep/18 Updated: 24/Oct/19 Resolved: 28/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Qian Yingjin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ORNL, changelog, hsm, llnl | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
When storing changelogs we do not strictly order the entries with respect to index. int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, struct llog_changelog_rec *rec, struct thandle *th) { ... /* llog_lvfs_write_rec sets the llog tail len */ rec->cr_hdr.lrh_type = CHANGELOG_REC; rec->cr.cr_time = cl_time(); spin_lock(&mdd->mdd_cl.mc_lock); /* NB: I suppose it's possible llog_add adds out of order wrt cr_index, * but as long as the MDD transactions are ordered correctly for e.g. * rename conflicts, I don't think this should matter. */ rec->cr.cr_index = ++mdd->mdd_cl.mc_index; spin_unlock(&mdd->mdd_cl.mc_lock); ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT); if (ctxt == NULL) return -ENXIO; llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj); if (IS_ERR(llog_th)) GOTO(out_put, rc = PTR_ERR(llog_th)); /* nested journal transaction */ rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th); This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries. |
| Comments |
| Comment by Andreas Dilger [ 25/Sep/18 ] |
|
On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead. That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record. This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk. |
| Comment by Olaf Weber [ 26/Sep/18 ] |
|
This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries. Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index. But the reader filters records by cr_index even if they come after records that are being transmitted to the reader. In any case, this issue greatly complicates changelog readers. |
| Comment by Olaf Weber [ 30/Sep/18 ] |
|
And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N? |
| Comment by Olaf Weber [ 29/Oct/18 ] |
|
One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either. |
| Comment by Olaf Weber [ 18/Apr/19 ] |
|
|
| Comment by James A Simmons [ 27/Jun/19 ] |
|
So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math |
| Comment by Olaf Weber [ 06/Aug/19 ] |
|
It turns out that |
| Comment by Qian Yingjin [ 07/Aug/19 ] |
|
Hi Olaf, The out of order for changlog records can be easily reproduced by using the test_160j scripts (https://review.whamcloud.com/#/c/35650/) in a single machine. But I am curious that how does HPE DMF solution generate the following message: 2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026 Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/) records around "6423027 - 6423029" finally.
For the question: how do I tell the difference between "missing" and "out of order"? I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly. |
| Comment by Olaf Weber [ 07/Aug/19 ] |
|
The DMF7 changelog processor tracks the indices of records it has read but not yet cleared. If there appears to be a gap in that sequence it will not clear beyond that hap until at least 1000 newer records have been seen. At that point it assumes that the record in the gap will never turn up, reports it as "lost", and allows clearing the log beyond the gap. If, according to the patch, that window should be extended from 1000 to 1024*2 = 2048, we can certainly do that, as it is a simple tunable of the program. However, when I disable clearing the changelog, and then read it using lfs, I can easily verify that most lost records do not show up within the next million records. In a handful of cases records show up inconsistently. When that happens, they always appear to be part of a small out-of-order sequence. For what it is worth, the dmf changelog processor has never, to my knowledge, encountered |
| Comment by Olaf Weber [ 09/Aug/19 ] |
|
HPE is also working on this issue with Cray, and we've been testing code that appears to improve matters. (Cray bug id may be LUS-7691) |
| Comment by Gerrit Updater [ 14/Sep/19 ] |
|
Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/36187 |
| Comment by Gerrit Updater [ 27/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36187/ |
| Comment by Peter Jones [ 28/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Bruno Travouillon (Inactive) [ 28/Sep/19 ] |
|
Hi Peter, |
| Comment by Peter Jones [ 28/Sep/19 ] |
|
Yes absolutely - a number of sites have hit this |
| Comment by James A Simmons [ 28/Sep/19 ] |
|
To back port this fix will require a bunch of other change log fixes to land first. One of them is the add polling patch. Is that patch needed for 2.12 LTS? |
| Comment by Gerrit Updater [ 28/Sep/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36316 |
| Comment by Peter Jones [ 28/Sep/19 ] |
|
simmonsja could you please be more specific as to what is missing from b2_12 before the benefits of this patch could be realized? |
| Comment by James A Simmons [ 01/Oct/19 ] |
|
We have also Lustre commit d7bb6647cd4dd26949bceb6a099cd606623aff2b (" The other patch is
|
| Comment by Peter Jones [ 01/Oct/19 ] |
|
Thanks James. The second patch looks like a nice performance improvement but I am not sure how this would relate to changelog correctness. Are you just listing all the patches that you are currently applying to b2_12 regardless of their relevance to this particular issue? |
| Comment by James A Simmons [ 01/Oct/19 ] |
|
I'm also applying |
| Comment by Gerrit Updater [ 08/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36316/ |