Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11426

2/2 Olafs agree: changelog entries are emitted out of order

Details

    • 3
    • 9223372036854775807

    Description

      When storing changelogs we do not strictly order the entries with respect to index.

      int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                              struct llog_changelog_rec *rec, struct thandle *th)
      {
              ...
      
              /* llog_lvfs_write_rec sets the llog tail len */
              rec->cr_hdr.lrh_type = CHANGELOG_REC;
              rec->cr.cr_time = cl_time();
      
              spin_lock(&mdd->mdd_cl.mc_lock);
              /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
               * but as long as the MDD transactions are ordered correctly for e.g.
               * rename conflicts, I don't think this should matter. */
              rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
              spin_unlock(&mdd->mdd_cl.mc_lock);
      
              ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT);
              if (ctxt == NULL)
                      return -ENXIO;
      
              llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj);
              if (IS_ERR(llog_th))
                      GOTO(out_put, rc = PTR_ERR(llog_th));
      
              /* nested journal transaction */
              rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);
      

      This may not be a bug in itself but it means that a changelog consumer must account for gaps in the sequence when clearing changelog entries.

      Attachments

        Issue Links

          Activity

            [LU-11426] 2/2 Olafs agree: changelog entries are emitted out of order

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36316
            Subject: LU-11426 llog: changelog records reordering
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2b7ea4e3cf43f18b1e1354044d024e085701946e

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36316 Subject: LU-11426 llog: changelog records reordering Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2b7ea4e3cf43f18b1e1354044d024e085701946e

            To back port this fix will require a bunch of other change log fixes to land first. One of them is the add polling patch. Is that patch needed for 2.12 LTS?

            simmonsja James A Simmons added a comment - To back port this fix will require a bunch of other change log fixes to land first. One of them is the add polling patch. Is that patch needed for 2.12 LTS?
            pjones Peter Jones added a comment -

            Yes absolutely - a number of sites have hit this

            pjones Peter Jones added a comment - Yes absolutely - a number of sites have hit this

            Hi Peter,
            Can we consider a backport to LTS12? (2.12.4?)

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi Peter, Can we consider a backport to LTS12? (2.12.4?)
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36187/
            Subject: LU-11426 llog: changelog records reordering
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1fa0a984c5c3863d8f40b3b0d63c3d08cfa1a9f0

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36187/ Subject: LU-11426 llog: changelog records reordering Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1fa0a984c5c3863d8f40b3b0d63c3d08cfa1a9f0

            Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/36187
            Subject: LU-11426 llog: changelog records reordering
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a4611acc55049002eb6736d8fc504aab23e4a71c

            gerrit Gerrit Updater added a comment - Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/36187 Subject: LU-11426 llog: changelog records reordering Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a4611acc55049002eb6736d8fc504aab23e4a71c

            HPE is also working on this issue with Cray, and we've been testing code that appears to improve matters. (Cray bug id may be LUS-7691)

            olaf Olaf Weber (Inactive) added a comment - HPE is also working on this issue with Cray, and we've been testing code that appears to improve matters. (Cray bug id may be LUS-7691)
            olaf Olaf Weber (Inactive) added a comment - - edited

            The DMF7 changelog processor tracks the indices of records it has read but not yet cleared. If there appears to be a gap in that sequence it will not clear beyond that hap until at least 1000 newer records have been seen. At that point it assumes that the record in the gap will never turn up, reports it as "lost", and allows clearing the log beyond the gap.

            If, according to the patch, that window should be extended from 1000 to 1024*2 = 2048, we can certainly do that, as it is a simple tunable of the program.

            However, when I disable clearing the changelog, and then read it using lfs, I can easily verify that most lost records do not show up within the next million records. In a handful of cases records show up inconsistently. When that happens, they always appear to be part of a small out-of-order sequence.

            For what it is worth, the dmf changelog processor has never, to my knowledge, encountered LU-11205.

            olaf Olaf Weber (Inactive) added a comment - - edited The DMF7 changelog processor tracks the indices of records it has read but not yet cleared. If there appears to be a gap in that sequence it will not clear beyond that hap until at least 1000 newer records have been seen. At that point it assumes that the record in the gap will never turn up, reports it as "lost", and allows clearing the log beyond the gap. If, according to the patch, that window should be extended from 1000 to 1024*2 = 2048, we can certainly do that, as it is a simple tunable of the program. However, when I disable clearing the changelog, and then read it using lfs, I can easily verify that most lost records do not show up within the next million records. In a handful of cases records show up inconsistently. When that happens, they always appear to be part of a small out-of-order sequence. For what it is worth, the dmf changelog processor has never, to my knowledge, encountered  LU-11205 .
            qian_wc Qian Yingjin added a comment -

            Hi Olaf,

            The out of order for changlog records can be easily reproduced by using the test_160j scripts (https://review.whamcloud.com/#/c/35650/) in a single machine.

            But I am curious that how does HPE DMF solution generate the following message:

            2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026
            

            Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/) records around "6423027 - 6423029" finally.

             

            For the question: how do I tell the difference between "missing" and "out of order"? 

            I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

            qian_wc Qian Yingjin added a comment - Hi Olaf, The out of order for changlog records can be easily reproduced by using the test_160j scripts ( https://review.whamcloud.com/#/c/35650/ ) in a single machine. But I am curious that how does HPE DMF solution generate the following message: 2019-06-21 15:23:51.045796+01:00 946781.704204 tesseract-dmf-core01 dmf-lustre-changelog-processor 39762 39847 INFO ---------- tessfs1-MDT0001 Lost 2276026 Although current changlog mechanism can not ensure the changelog records strictly in the index order, but it should not "Lost 2276026", it should be appeared in N (which is 1024*2 in the patch https://review.whamcloud.com/#/c/35650/ ) records around "6423027 - 6423029" finally.   For the question: how do I tell the difference between "missing" and "out of order"?  I think you do not worry about cancellation the records that are out of order, the initial version of above patch has already consider this case, can handle it correctly.

            It turns out that LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

            olaf Olaf Weber (Inactive) added a comment - It turns out that  LU-11581 is a major problem for HPE DMF, and since it was marked a duplicate of this one, that makes this LU a major problem for us

            People

              qian_wc Qian Yingjin
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: