Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17290

Don't deregister idle changelog consumers

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      In (some of) our customer's experience, we get complaints that restarting their consumers is too "high touch" - they have to interact with each MDT manually to re-register a new consumer ID, then their kubernetes kafka whatever setup to change some config and redeploy etc. They might be ok with missing the records (which happens if they have to reregister or not), they just don't like the additional ID reconfig hassle.

      Deregistration of an idle changelog consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. It would make more sense to mark this consumer internally as "stale" and simply ignore it during the lowest-unconsumed-record check. Then if the consumer does come back to life, we remove the "stale" flag and the consumer still has access to the (remaining) changelog records.
      This means less impact on users with an intermittently-working consumer. Stale consumers can be reported/seen in mdd.*.changelog_users.

      If a stale consumer is still alive and connected, it can continue consuming records. (An idle consumer on an idle system would feel no impact.)

      If disconnected and restarted, a stale consumer would restart with the old ID in llapi_changelog_start(), which would return -ESTALE in this case. Consumers that are aware of this feature can take appropriate action as they need, and then re-start a second time which would then succeed. Old unaware consumers that don't understand ESTALE would presumably fail with the error and require manual intervention, just like current deregistration/reregistration (which would also still work).

      The important part is that this way, modern consumers can automatically do their recovery without having to do anything special on the MDS itself.

      See also LU-14699

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              nrutman Nathan Rutman
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: