[LU-14699] changelog garbage collection is too lax - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The changelog garbage collection enabled by ~~LU-12871~~ is too lazy. It will only purge an idle changelog user and its records if the changelog itself is nearly full:

        if (likely(mdd->mdd_changelog_gc &&
                     mdd->mdd_cl.mc_gc_task == MDD_CHLG_GC_NONE &&
                     ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time >
                        mdd->mdd_changelog_min_gc_interval)) {
                if (unlikely(llog_cat_free_space(ctxt->loc_handle) <=
                             mdd->mdd_changelog_min_free_cat_entries ||
                             OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD))) {
                        CWARN("%s:%s low on changelog_catalog free entries, "
                              "starting ChangeLog garbage collection thread\n",
                              obd->obd_name,
                              OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD) ?
                                " simulate" : "");

The default mdd_changelog_min_free_cat_entries=2 and mdd_changelog_min_gc_interval=3600 so it will only check every hour if the changelog is within 2x65000 = 130000 entries of overflowing (out of ~4B entries), even if the changelog has been idle for weeks (with reduced settings, just to verify it is not evicted):

# lctl get_param mdd.*.changelog*                       |
mdd.myth-MDT0000.changelog_deniednext=60                                        |
mdd.myth-MDT0000.changelog_gc=1                                                 
mdd.myth-MDT0000.changelog_max_idle_indexes=20800000                            
mdd.myth-MDT0000.changelog_max_idle_time=2500000                                
mdd.myth-MDT0000.changelog_min_free_cat_entries=2                               
mdd.myth-MDT0000.changelog_min_gc_interval=3600                                 
mdd.myth-MDT0000.changelog_size=3857464008                                      
mdd.myth-MDT0000.changelog_mask=                                                
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SAT
TR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC                                      
mdd.myth-MDT0000.changelog_users=
current index: 98130425
ID    index (idle seconds)
cl3   77315666 (2512315)

It would be better to evict idle changelog users after a week or two, which is plenty of time to get a broken consumer working again, even if the log isn't totally full.

Attachments

Issue Links

is related to

LU-14626 Idle Changelog user not deregistered

Resolved

LU-17290 Don't deregister idle changelog consumers

Open

LU-13772 mdt: changelog_deregister takes too long

Resolved

LU-15524 initiate changelog GC by lack of free space

Resolved

is related to

LU-12871 enable changelog garbage collection by default

Resolved

LU-14688 Changelog cancel improvement

Resolved

LU-13055 add ability for named Changelog consumers

Closed

(2 is related to )

Activity

[LU-14699] changelog garbage collection is too lax

Nathan Rutman added a comment - 09/Nov/23 7:10 PM

I realize this ticket is closed, but this seems an appropriate place to ask:

Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system.

Nathan Rutman added a comment - 09/Nov/23 7:10 PM I realize this ticket is closed, but this seems an appropriate place to ask: Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system.

Etienne Aujames added a comment - 19/Jan/22 5:04 PM

I have backported the https://review.whamcloud.com/46203 for testing purpose.

Etienne Aujames added a comment - 19/Jan/22 5:04 PM I have backported the https://review.whamcloud.com/46203 for testing purpose.

Gerrit Updater added a comment - 19/Jan/22 4:59 PM

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203
Subject: ~~LU-14699~~ mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ec383a43b5d236c538bb24e8c22b95e577c74401

Gerrit Updater added a comment - 19/Jan/22 4:59 PM "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203 Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ec383a43b5d236c538bb24e8c22b95e577c74401

Peter Jones added a comment - 30/Nov/21 1:56 PM

Landed for 2.15

Peter Jones added a comment - 30/Nov/21 1:56 PM Landed for 2.15

Gerrit Updater added a comment - 30/Nov/21 3:47 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/
Subject: ~~LU-14699~~ mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f60b307c5001e1d9035af61d2344af33d3ea0f85

Gerrit Updater added a comment - 30/Nov/21 3:47 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/ Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: master Current Patch Set: Commit: f60b307c5001e1d9035af61d2344af33d3ea0f85

Gerrit Updater added a comment - 27/Sep/21 10:34 PM

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068
Subject: ~~LU-14699~~ mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 892a30f5ae6fc580afc20f3d6fdd5edfcf9e6fc0

Gerrit Updater added a comment - 27/Sep/21 10:34 PM "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068 Subject: LU-14699 mdd: proactive changelog garbage collection Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 892a30f5ae6fc580afc20f3d6fdd5edfcf9e6fc0

John Hammond added a comment - 14/Sep/21 12:05 PM

> Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

cur_time is also set on changelog_clear. And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric.

John Hammond added a comment - 14/Sep/21 12:05 PM > Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age cur_time is also set on changelog_clear . And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric.

Mikhail Pershin added a comment - 09/Sep/21 8:52 PM - edited

the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

Mikhail Pershin added a comment - 09/Sep/21 8:52 PM - edited the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

Andreas Dilger added a comment - 09/Sep/21 8:14 PM - edited

The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway.

We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 "LU-13055 mdd: per-user changelog names and mask", but that isn't the only reason why idle Changelog users exist, and it will be a while before named users are created by all applications.

I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server):

# # lctl get_param mdd.*.changelog*
mdd.myth-MDT0000.changelog_deniednext=60
mdd.myth-MDT0000.changelog_gc=1
mdd.myth-MDT0000.changelog_max_idle_indexes=20971520
mdd.myth-MDT0000.changelog_max_idle_time=2592000
mdd.myth-MDT0000.changelog_min_free_cat_entries=2
mdd.myth-MDT0000.changelog_min_gc_interval=30
mdd.myth-MDT0000.changelog_size=3884936000
mdd.myth-MDT0000.changelog_mask=
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC 
mdd.myth-MDT0000.changelog_users=
current index: 98287389
ID    index (idle seconds)
cl3   77315666 (12082313)
# echo $((98287389 - 77315666))
20971723

This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values.

Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default:

         mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes);

since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems.

In DDN-2174, I suggested "(current_index - idle_index) * idle_seconds / 86400 > 4B" as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly.

I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 "LU-12871 mdd: enable Changelog garbage collection" was landed in 2.14 (was supposed to have been the default back in 2.11). In the vast majority of cases, the weeks-idle Changelog user isn't important to anyone (or there would have been a consumer), and having the full Changelog/MDT cause the filesystem to become unusable (see many linked bugs) doesn't make anyone happy.

If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet.

Andreas Dilger added a comment - 09/Sep/21 8:14 PM - edited The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway. We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 " LU-13055 mdd: per-user changelog names and mask ", but that isn't the only reason why idle Changelog users exist, and it will be a while before named users are created by all applications. I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server): # # lctl get_param mdd.*.changelog* mdd.myth-MDT0000.changelog_deniednext=60 mdd.myth-MDT0000.changelog_gc=1 mdd.myth-MDT0000.changelog_max_idle_indexes=20971520 mdd.myth-MDT0000.changelog_max_idle_time=2592000 mdd.myth-MDT0000.changelog_min_free_cat_entries=2 mdd.myth-MDT0000.changelog_min_gc_interval=30 mdd.myth-MDT0000.changelog_size=3884936000 mdd.myth-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.myth-MDT0000.changelog_users= current index: 98287389 ID index (idle seconds) cl3 77315666 (12082313) # echo $((98287389 - 77315666)) 20971723 This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values. Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default: mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes); since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems. In DDN-2174, I suggested " (current_index - idle_index) * idle_seconds / 86400 > 4B " as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly. I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 " LU-12871 mdd: enable Changelog garbage collection " was landed in 2.14 (was supposed to have been the default back in 2.11). In the vast majority of cases, the weeks-idle Changelog user isn't important to anyone (or there would have been a consumer), and having the full Changelog/MDT cause the filesystem to become unusable (see many linked bugs) doesn't make anyone happy. If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet.

Mikhail Pershin added a comment - 09/Sep/21 1:05 PM - edited

I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all.

As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced

Mikhail Pershin added a comment - 09/Sep/21 1:05 PM - edited I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all. As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced

People

Assignee:: Mikhail Pershin

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 22/May/21 12:04 AM

Updated:: 05/Nov/24 11:23 PM

Resolved:: 30/Nov/21 1:56 PM