Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14699

changelog garbage collection is too lax

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The changelog garbage collection enabled by LU-12871 is too lazy. It will only purge an idle changelog user and its records if the changelog itself is nearly full:

              if (likely(mdd->mdd_changelog_gc &&
                           mdd->mdd_cl.mc_gc_task == MDD_CHLG_GC_NONE &&
                           ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time >
                              mdd->mdd_changelog_min_gc_interval)) {
                      if (unlikely(llog_cat_free_space(ctxt->loc_handle) <=
                                   mdd->mdd_changelog_min_free_cat_entries ||
                                   OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD))) {
                              CWARN("%s:%s low on changelog_catalog free entries, "
                                    "starting ChangeLog garbage collection thread\n",
                                    obd->obd_name,
                                    OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD) ?
                                      " simulate" : "");
      

      The default mdd_changelog_min_free_cat_entries=2 and mdd_changelog_min_gc_interval=3600 so it will only check every hour if the changelog is within 2x65000 = 130000 entries of overflowing (out of ~4B entries), even if the changelog has been idle for weeks (with reduced settings, just to verify it is not evicted):

      # lctl get_param mdd.*.changelog*                       |
      mdd.myth-MDT0000.changelog_deniednext=60                                        |
      mdd.myth-MDT0000.changelog_gc=1                                                 
      mdd.myth-MDT0000.changelog_max_idle_indexes=20800000                            
      mdd.myth-MDT0000.changelog_max_idle_time=2500000                                
      mdd.myth-MDT0000.changelog_min_free_cat_entries=2                               
      mdd.myth-MDT0000.changelog_min_gc_interval=3600                                 
      mdd.myth-MDT0000.changelog_size=3857464008                                      
      mdd.myth-MDT0000.changelog_mask=                                                
      MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SAT
      TR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC                                      
      mdd.myth-MDT0000.changelog_users=
      current index: 98130425
      ID    index (idle seconds)
      cl3   77315666 (2512315)
      

      It would be better to evict idle changelog users after a week or two, which is plenty of time to get a broken consumer working again, even if the log isn't totally full.

      Attachments

        Issue Links

          Activity

            [LU-14699] changelog garbage collection is too lax

            I realize this ticket is closed, but this seems an appropriate place to ask:

            Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system.

            nrutman Nathan Rutman added a comment - I realize this ticket is closed, but this seems an appropriate place to ask: Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system.

            I have backported the https://review.whamcloud.com/46203 for testing purpose.

            eaujames Etienne Aujames added a comment - I have backported the https://review.whamcloud.com/46203 for testing purpose.

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203
            Subject: LU-14699 mdd: proactive changelog garbage collection
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ec383a43b5d236c538bb24e8c22b95e577c74401

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203 Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ec383a43b5d236c538bb24e8c22b95e577c74401
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/
            Subject: LU-14699 mdd: proactive changelog garbage collection
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f60b307c5001e1d9035af61d2344af33d3ea0f85

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/ Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: master Current Patch Set: Commit: f60b307c5001e1d9035af61d2344af33d3ea0f85

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068
            Subject: LU-14699 mdd: proactive changelog garbage collection
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 892a30f5ae6fc580afc20f3d6fdd5edfcf9e6fc0

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068 Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 892a30f5ae6fc580afc20f3d6fdd5edfcf9e6fc0

            > Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

            cur_time is also set on changelog_clear. And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric.

            jhammond John Hammond added a comment - > Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age cur_time is also set on changelog_clear . And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric.
            tappro Mikhail Pershin added a comment - - edited

            the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

            tappro Mikhail Pershin added a comment - - edited the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age
            adilger Andreas Dilger added a comment - - edited

            The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway.

            We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 "LU-13055 mdd: per-user changelog names and mask", but that isn't the only reason why idle Changelog users exist, and it will be a while before named users are created by all applications.

            I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server):

            # # lctl get_param mdd.*.changelog*
            mdd.myth-MDT0000.changelog_deniednext=60
            mdd.myth-MDT0000.changelog_gc=1
            mdd.myth-MDT0000.changelog_max_idle_indexes=20971520
            mdd.myth-MDT0000.changelog_max_idle_time=2592000
            mdd.myth-MDT0000.changelog_min_free_cat_entries=2
            mdd.myth-MDT0000.changelog_min_gc_interval=30
            mdd.myth-MDT0000.changelog_size=3884936000
            mdd.myth-MDT0000.changelog_mask=
            MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC 
            mdd.myth-MDT0000.changelog_users=
            current index: 98287389
            ID    index (idle seconds)
            cl3   77315666 (12082313)
            # echo $((98287389 - 77315666))
            20971723
            

            This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values.

            Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default:

                     mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes);
            

            since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems.

            In DDN-2174, I suggested "(current_index - idle_index) * idle_seconds / 86400 > 4B" as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly.

            I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 "LU-12871 mdd: enable Changelog garbage collection" was landed in 2.14 (was supposed to have been the default back in 2.11). In the vast majority of cases, the weeks-idle Changelog user isn't important to anyone (or there would have been a consumer), and having the full Changelog/MDT cause the filesystem to become unusable (see many linked bugs) doesn't make anyone happy.

            If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet.

            adilger Andreas Dilger added a comment - - edited The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway. We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 " LU-13055 mdd: per-user changelog names and mask ", but that isn't the only reason why idle Changelog users exist, and it will be a while before named users are created by all applications. I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server): # # lctl get_param mdd.*.changelog* mdd.myth-MDT0000.changelog_deniednext=60 mdd.myth-MDT0000.changelog_gc=1 mdd.myth-MDT0000.changelog_max_idle_indexes=20971520 mdd.myth-MDT0000.changelog_max_idle_time=2592000 mdd.myth-MDT0000.changelog_min_free_cat_entries=2 mdd.myth-MDT0000.changelog_min_gc_interval=30 mdd.myth-MDT0000.changelog_size=3884936000 mdd.myth-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.myth-MDT0000.changelog_users= current index: 98287389 ID index (idle seconds) cl3 77315666 (12082313) # echo $((98287389 - 77315666)) 20971723 This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values. Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default: mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes); since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems. In DDN-2174, I suggested " (current_index - idle_index) * idle_seconds / 86400 > 4B " as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly. I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 " LU-12871 mdd: enable Changelog garbage collection " was landed in 2.14 (was supposed to have been the default back in 2.11). In the vast majority of cases, the weeks-idle Changelog user isn't important to anyone (or there would have been a consumer), and having the full Changelog/MDT cause the filesystem to become unusable (see many linked bugs) doesn't make anyone happy. If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet.
            tappro Mikhail Pershin added a comment - - edited

            I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays  it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all.

            As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced

             

            tappro Mikhail Pershin added a comment - - edited I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays  it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all. As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced  

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: