[LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order Created: 25/Sep/18  Updated: 24/Oct/19  Resolved: 28/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: ORNL, changelog, hsm, llnl

Issue Links:
Duplicate
is duplicated by LU-12869 interop: lustre-rsync-test test 2a fa... Open
Related
is related to LU-11205 Failure to clear the changelog for us... Resolved
is related to LU-11581 Not all changelog entries are returne... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When storing changelogs we do not strictly order the entries with respect to index.

int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                        struct llog_changelog_rec *rec, struct thandle *th)
{
        ...

        /* llog_lvfs_write_rec sets the llog tail len */
        rec->cr_hdr.lrh_type = CHANGELOG_REC;
        rec->cr.cr_time = cl_time();

        spin_lock(&mdd->mdd_cl.mc_lock);
        /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
         * but as long as the MDD transactions are ordered correctly for e.g.
         * rename conflicts, I don't think this should matter. */
        rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
        spin_unlock(&mdd->mdd_cl.mc_lock);

        ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT);
        if (ctxt == NULL)
                return -ENXIO;

        llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj);
        if (IS_ERR(llog_th))
                GOTO(out_put, rc = PTR_ERR(llog_th));

        /* nested journal transaction */
        rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);

This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.



 Comments   
Comment by Andreas Dilger [ 25/Sep/18 ]

On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead.  That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record.

This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk.

Comment by Olaf Weber [ 26/Sep/18 ]

This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index. But the reader filters records by cr_index even if they come after records that are being transmitted to the reader.

In any case, this issue greatly complicates changelog readers.

Comment by Olaf Weber [ 30/Sep/18 ]

And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N?

Comment by Olaf Weber [ 29/Oct/18 ]

One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either.

Comment by Olaf Weber [ 18/Apr/19 ]

LU-11581 was closed as a duplicate of this one. The replication instructions are of interest for this LU as well.

Comment by James A Simmons [ 27/Jun/19 ]

So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math  

Comment by Olaf Weber [ 06/Aug/19 ]

It turns out that LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

Comment by Qian Yingjin [ 07/Aug/19 ]

Hi Olaf,

The out of order for changlog records can be easily reproduced by using the test_160j scripts (https://review.whamcloud.com/#/c/35650/) in a single machine.

But I am curious that how does HPE DMF solution generate the following message:

2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026

Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/) records around "6423027 - 6423029" finally.

 

For the question: how do I tell the difference between "missing" and "out of order"? 

I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

Comment by Olaf Weber [ 07/Aug/19 ]

The DMF7 changelog processor tracks the indices of records it has read but not yet cleared. If there appears to be a gap in that sequence it will not clear beyond that hap until at least 1000 newer records have been seen. At that point it assumes that the record in the gap will never turn up, reports it as "lost", and allows clearing the log beyond the gap.

If, according to the patch, that window should be extended from 1000 to 1024*2 = 2048, we can certainly do that, as it is a simple tunable of the program.

However, when I disable clearing the changelog, and then read it using lfs, I can easily verify that most lost records do not show up within the next million records. In a handful of cases records show up inconsistently. When that happens, they always appear to be part of a small out-of-order sequence.

For what it is worth, the dmf changelog processor has never, to my knowledge, encountered LU-11205.

Comment by Olaf Weber [ 09/Aug/19 ]

HPE is also working on this issue with Cray, and we've been testing code that appears to improve matters. (Cray bug id may be LUS-7691)

Comment by Gerrit Updater [ 14/Sep/19 ]

Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/36187
Subject: LU-11426 llog: changelog records reordering
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4611acc55049002eb6736d8fc504aab23e4a71c

Comment by Gerrit Updater [ 27/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36187/
Subject: LU-11426 llog: changelog records reordering
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1fa0a984c5c3863d8f40b3b0d63c3d08cfa1a9f0

Comment by Peter Jones [ 28/Sep/19 ]

Landed for 2.13

Comment by Bruno Travouillon (Inactive) [ 28/Sep/19 ]

Hi Peter,
Can we consider a backport to LTS12? (2.12.4?)

Comment by Peter Jones [ 28/Sep/19 ]

Yes absolutely - a number of sites have hit this

Comment by James A Simmons [ 28/Sep/19 ]

To back port this fix will require a bunch of other change log fixes to land first. One of them is the add polling patch. Is that patch needed for 2.12 LTS?

Comment by Gerrit Updater [ 28/Sep/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36316
Subject: LU-11426 llog: changelog records reordering
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2b7ea4e3cf43f18b1e1354044d024e085701946e

Comment by Peter Jones [ 28/Sep/19 ]

simmonsja could you please be more specific as to what is missing from b2_12 before the benefits of this patch could be realized?

Comment by James A Simmons [ 01/Oct/19 ]

We have also

Lustre commit d7bb6647cd4dd26949bceb6a099cd606623aff2b ("LU-11626 mdc: hold obd while processing changelog")

The other patch is LU-12533 but I don't know if that is more a feature or a fix so its not clear in that case.

 

Comment by Peter Jones [ 01/Oct/19 ]

Thanks James. The second patch looks like a nice performance improvement but I am not sure how this would relate to changelog correctness. Are you just listing all the patches that you are currently applying to b2_12 regardless of their relevance to this particular issue?

Comment by James A Simmons [ 01/Oct/19 ]

I'm also applying LU-11617 which is a potential bug as well as the above patches. The reason for also including LU-12533 was so the patches applied cleanly to 2.12 LTS. If its not needed then thats fine

Comment by Gerrit Updater [ 08/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36316/
Subject: LU-11426 llog: changelog records reordering
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5c5a1e9b4839c3d6a70b3b7e768944f6dc237c2e

Generated at Sat Feb 10 02:43:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.