[LU-14699] changelog garbage collection is too lax Created: 22/May/21  Updated: 30/Nov/23  Resolved: 30/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12871 enable changelog garbage collection b... Resolved
is related to LU-14688 Changelog cancel improvement Resolved
is related to LU-13055 add ability for named Changelog consu... Closed
is related to LU-14626 Idle Changelog user not deregistered Resolved
is related to LU-17290 Don't deregister idle changelog consu... Open
is related to LU-13772 mdt: changelog_deregister takes too long Resolved
is related to LU-15524 initiate changelog GC by lack of free... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The changelog garbage collection enabled by LU-12871 is too lazy. It will only purge an idle changelog user and its records if the changelog itself is nearly full:

        if (likely(mdd->mdd_changelog_gc &&
                     mdd->mdd_cl.mc_gc_task == MDD_CHLG_GC_NONE &&
                     ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time >
                        mdd->mdd_changelog_min_gc_interval)) {
                if (unlikely(llog_cat_free_space(ctxt->loc_handle) <=
                             mdd->mdd_changelog_min_free_cat_entries ||
                             OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD))) {
                        CWARN("%s:%s low on changelog_catalog free entries, "
                              "starting ChangeLog garbage collection thread\n",
                              obd->obd_name,
                              OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD) ?
                                " simulate" : "");

The default mdd_changelog_min_free_cat_entries=2 and mdd_changelog_min_gc_interval=3600 so it will only check every hour if the changelog is within 2x65000 = 130000 entries of overflowing (out of ~4B entries), even if the changelog has been idle for weeks (with reduced settings, just to verify it is not evicted):

# lctl get_param mdd.*.changelog*                       |
mdd.myth-MDT0000.changelog_deniednext=60                                        |
mdd.myth-MDT0000.changelog_gc=1                                                 
mdd.myth-MDT0000.changelog_max_idle_indexes=20800000                            
mdd.myth-MDT0000.changelog_max_idle_time=2500000                                
mdd.myth-MDT0000.changelog_min_free_cat_entries=2                               
mdd.myth-MDT0000.changelog_min_gc_interval=3600                                 
mdd.myth-MDT0000.changelog_size=3857464008                                      
mdd.myth-MDT0000.changelog_mask=                                                
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SAT
TR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC                                      
mdd.myth-MDT0000.changelog_users=
current index: 98130425
ID    index (idle seconds)
cl3   77315666 (2512315)

It would be better to evict idle changelog users after a week or two, which is plenty of time to get a broken consumer working again, even if the log isn't totally full.



 Comments   
Comment by Andreas Dilger [ 27/May/21 ]

I think rather than having a hard age/record cutoff (both of which may go wrong on certain cases) that we instead check "(records x hours > (1<<32))" or similar (in addition to the current "out of space" check). That will handle cases with very old users on systems with few records (30 days x 6M records), or less idle (7 days x 25M records).

Comment by Andreas Dilger [ 27/May/21 ]

For users that cannot be evicted (eg. strict audit users) it would be desirable to allow a flag on a per-user basis (default off) that prevents that user from being garbage collected.

Comment by Mikhail Pershin [ 09/Sep/21 ]

I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays  it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all.

As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced

 

Comment by Andreas Dilger [ 09/Sep/21 ]

The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway.

We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 "LU-13055 mdd: per-user changelog names and mask", but that isn't the only reason why idle Changelog users exist, and it will be a while before named users are created by all applications.

I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server):

# # lctl get_param mdd.*.changelog*
mdd.myth-MDT0000.changelog_deniednext=60
mdd.myth-MDT0000.changelog_gc=1
mdd.myth-MDT0000.changelog_max_idle_indexes=20971520
mdd.myth-MDT0000.changelog_max_idle_time=2592000
mdd.myth-MDT0000.changelog_min_free_cat_entries=2
mdd.myth-MDT0000.changelog_min_gc_interval=30
mdd.myth-MDT0000.changelog_size=3884936000
mdd.myth-MDT0000.changelog_mask=
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC 
mdd.myth-MDT0000.changelog_users=
current index: 98287389
ID    index (idle seconds)
cl3   77315666 (12082313)
# echo $((98287389 - 77315666))
20971723

This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values.

Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default:

         mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes);

since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems.

In DDN-2174, I suggested "(current_index - idle_index) * idle_seconds / 86400 > 4B" as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly.

I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 "LU-12871 mdd: enable Changelog garbage collection" was landed in 2.14 (was supposed to have been the default back in 2.11). In the vast majority of cases, the weeks-idle Changelog user isn't important to anyone (or there would have been a consumer), and having the full Changelog/MDT cause the filesystem to become unusable (see many linked bugs) doesn't make anyone happy.

If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet.

Comment by Mikhail Pershin [ 09/Sep/21 ]

the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

Comment by John Hammond [ 14/Sep/21 ]

> Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age

cur_time is also set on changelog_clear. And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric.

Comment by Gerrit Updater [ 27/Sep/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068
Subject: LU-14699 mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 892a30f5ae6fc580afc20f3d6fdd5edfcf9e6fc0

Comment by Gerrit Updater [ 30/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/
Subject: LU-14699 mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f60b307c5001e1d9035af61d2344af33d3ea0f85

Comment by Peter Jones [ 30/Nov/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 19/Jan/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203
Subject: LU-14699 mdd: proactive changelog garbage collection
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ec383a43b5d236c538bb24e8c22b95e577c74401

Comment by Etienne Aujames [ 19/Jan/22 ]

I have backported the https://review.whamcloud.com/46203 for testing purpose.

Comment by Nathan Rutman [ 09/Nov/23 ]

I realize this ticket is closed, but this seems an appropriate place to ask:

Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system.

Generated at Sat Feb 10 03:12:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.