[LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3
Affects Version/s: None
Labels:
- ORNL
- changelog
- hsm
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When storing changelogs we do not strictly order the entries with respect to index.

int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                        struct llog_changelog_rec *rec, struct thandle *th)
{
        ...

        /* llog_lvfs_write_rec sets the llog tail len */
        rec->cr_hdr.lrh_type = CHANGELOG_REC;
        rec->cr.cr_time = cl_time();

        spin_lock(&mdd->mdd_cl.mc_lock);
        /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
         * but as long as the MDD transactions are ordered correctly for e.g.
         * rename conflicts, I don't think this should matter. */
        rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
        spin_unlock(&mdd->mdd_cl.mc_lock);

        ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT);
        if (ctxt == NULL)
                return -ENXIO;

        llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj);
        if (IS_ERR(llog_th))
                GOTO(out_put, rc = PTR_ERR(llog_th));

        /* nested journal transaction */
        rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);

This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

Attachments

Issue Links

is duplicated by

LU-12869 interop: lustre-rsync-test test 2a fails with 'Failure in replication; differences found.'

Open

is related to

LU-11581 Not all changelog entries are returned to userspace

Resolved

is related to

LU-11205 Failure to clear the changelog for user 1 on MDT

Resolved

Activity

[LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order

Qian Yingjin added a comment - 07/Aug/19 1:46 AM

Hi Olaf,

The out of order for changlog records can be easily reproduced by using the test_160j scripts (https://review.whamcloud.com/#/c/35650/) in a single machine.

But I am curious that how does HPE DMF solution generate the following message:

2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026

Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/) records around "6423027 - 6423029" finally.

For the question: how do I tell the difference between "missing" and "out of order"?

I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

Qian Yingjin added a comment - 07/Aug/19 1:46 AM Hi Olaf, The out of order for changlog records can be easily reproduced by using the test_160j scripts ( https://review.whamcloud.com/#/c/35650/ ) in a single machine. But I am curious that how does HPE DMF solution generate the following message: 2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026 Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/ ) records around "6423027 - 6423029" finally. For the question: how do I tell the difference between "missing" and "out of order"? I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

Olaf Weber (Inactive) added a comment - 06/Aug/19 2:44 PM

It turns out that ~~LU-11581~~ is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

Olaf Weber (Inactive) added a comment - 06/Aug/19 2:44 PM It turns out that LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

James A Simmons added a comment - 27/Jun/19 5:58 PM

So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math

James A Simmons added a comment - 27/Jun/19 5:58 PM So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math

Olaf Weber (Inactive) added a comment - 18/Apr/19 6:46 PM

~~LU-11581~~ was closed as a duplicate of this one. The replication instructions are of interest for this LU as well.

Olaf Weber (Inactive) added a comment - 18/Apr/19 6:46 PM LU-11581 was closed as a duplicate of this one. The replication instructions are of interest for this LU as well.

Olaf Weber (Inactive) added a comment - 29/Oct/18 5:14 PM

One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either.

Olaf Weber (Inactive) added a comment - 29/Oct/18 5:14 PM One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either.

Olaf Weber (Inactive) added a comment - 30/Sep/18 1:38 PM

And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N?

Olaf Weber (Inactive) added a comment - 30/Sep/18 1:38 PM And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N?

Olaf Weber (Inactive) added a comment - 26/Sep/18 8:14 AM

This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index. But the reader filters records by cr_index even if they come after records that are being transmitted to the reader.

In any case, this issue greatly complicates changelog readers.

Olaf Weber (Inactive) added a comment - 26/Sep/18 8:14 AM This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries. Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index . But the reader filters records by cr_index even if they come after records that are being transmitted to the reader. In any case, this issue greatly complicates changelog readers.

Andreas Dilger added a comment - 25/Sep/18 2:38 PM

On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead. That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record.

This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk.

Andreas Dilger added a comment - 25/Sep/18 2:38 PM On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead. That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record. This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk.

People

Assignee:: Qian Yingjin

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 25/Sep/18 12:49 PM

Updated:: 24/Oct/19 9:23 PM

Resolved:: 28/Sep/19 3:40 AM