Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11426

2/2 Olafs agree: changelog entries are emitted out of order

Details

    • 3
    • 9223372036854775807

    Description

      When storing changelogs we do not strictly order the entries with respect to index.

      int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                              struct llog_changelog_rec *rec, struct thandle *th)
      {
              ...
      
              /* llog_lvfs_write_rec sets the llog tail len */
              rec->cr_hdr.lrh_type = CHANGELOG_REC;
              rec->cr.cr_time = cl_time();
      
              spin_lock(&mdd->mdd_cl.mc_lock);
              /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
               * but as long as the MDD transactions are ordered correctly for e.g.
               * rename conflicts, I don't think this should matter. */
              rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
              spin_unlock(&mdd->mdd_cl.mc_lock);
      
              ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT);
              if (ctxt == NULL)
                      return -ENXIO;
      
              llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj);
              if (IS_ERR(llog_th))
                      GOTO(out_put, rc = PTR_ERR(llog_th));
      
              /* nested journal transaction */
              rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);
      

      This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

      Attachments

        Issue Links

          Activity

            [LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order
            qian_wc Qian Yingjin added a comment -

            Hi Olaf,

            The out of order for changlog records can be easily reproduced by using the test_160j scripts (https://review.whamcloud.com/#/c/35650/) in a single machine.

            But I am curious that how does HPE DMF solution generate the following message:

            2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026
            

            Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/) records around "6423027 - 6423029" finally.

             

            For the question: how do I tell the difference between "missing" and "out of order"? 

            I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

            qian_wc Qian Yingjin added a comment - Hi Olaf, The out of order for changlog records can be easily reproduced by using the test_160j scripts ( https://review.whamcloud.com/#/c/35650/ ) in a single machine. But I am curious that how does HPE DMF solution generate the following message: 2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026 Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/ ) records around "6423027 - 6423029" finally.   For the question: how do I tell the difference between "missing" and "out of order"?  I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

            It turns out that LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

            olaf Olaf Weber (Inactive) added a comment - It turns out that  LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

            So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math  

            simmonsja James A Simmons added a comment - So I looked at the timestamps like you described and its just handled in a strange way. The time stamp is created with cl_time() which is in the format of secs << 30 + nsecs. Strange that nanoseconds since epoch wasn't just sent since its not hard math  

            LU-11581 was closed as a duplicate of this one. The replication instructions are of interest for this LU as well.

            olaf Olaf Weber (Inactive) added a comment - LU-11581  was closed as a duplicate of this one. The replication instructions are of interest for this LU as well.

            One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either.

            olaf Olaf Weber (Inactive) added a comment - One more thing: I have now see cases where two changelog entries affecting the same target have the same nanosecond timestamp. The implication of that is pretty much that if we cannot use the Index to sort entries, and it turns out that entries may not be emitted in the correct order, then sorting by timestamp will not work either.

            And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N?

            olaf Olaf Weber (Inactive) added a comment - And the final complication is that every once in a blue moon, an entry is indeed just not there. So you get a sequence like 6423027 - 6423029 and 6423028 never shows up. If these numbers were always increasing, then a "missing" entry is easy to spot and handle. But as the situation stands how do I tell the difference between "missing" and "out of order"? Right now the best I can do is to assume that if an expected entry doesn't show up within the next N entries, I should assume it won't ever. But what is a good value for N?

            This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

            Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index. But the reader filters records by cr_index even if they come after records that are being transmitted to the reader.

            In any case, this issue greatly complicates changelog readers.

            olaf Olaf Weber (Inactive) added a comment - This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries. Looking at the code the problem may be more at the restarting reading end: clearing apparently stops at the first record with a higher cr_index . But the reader filters records by cr_index even if they come after records that are being transmitted to the reader. In any case, this issue greatly complicates changelog readers.

            On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead.  That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record.

            This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk.

            adilger Andreas Dilger added a comment - On the one hand, out-of-order records are not great and fixing that would be nice if it was possible without significant overhead.  That said, this race can only happen for two records that are being generated within the same transaction group, since the transaction handles are already open, so it is not possible that the later (lower-numbered) record will always commit together with the earlier (higher-numbered) record. This means it should always be possible to reorder the records within a transaction group to be sequential and contiguous. Whether we should try to do this at the llapi layer, in the llog code, or up in the consumer is an open question, but I'd think it makes sense to do it in llapi or llog so that we don't need to burden every consumer to handle this quirk.

            People

              qian_wc Qian Yingjin
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: