[LU-14699] changelog garbage collection is too lax Created: 22/May/21 Updated: 30/Nov/23 Resolved: 30/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
The changelog garbage collection enabled by
if (likely(mdd->mdd_changelog_gc &&
mdd->mdd_cl.mc_gc_task == MDD_CHLG_GC_NONE &&
ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time >
mdd->mdd_changelog_min_gc_interval)) {
if (unlikely(llog_cat_free_space(ctxt->loc_handle) <=
mdd->mdd_changelog_min_free_cat_entries ||
OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD))) {
CWARN("%s:%s low on changelog_catalog free entries, "
"starting ChangeLog garbage collection thread\n",
obd->obd_name,
OBD_FAIL_CHECK(OBD_FAIL_FORCE_GC_THREAD) ?
" simulate" : "");
The default mdd_changelog_min_free_cat_entries=2 and mdd_changelog_min_gc_interval=3600 so it will only check every hour if the changelog is within 2x65000 = 130000 entries of overflowing (out of ~4B entries), even if the changelog has been idle for weeks (with reduced settings, just to verify it is not evicted): # lctl get_param mdd.*.changelog* | mdd.myth-MDT0000.changelog_deniednext=60 | mdd.myth-MDT0000.changelog_gc=1 mdd.myth-MDT0000.changelog_max_idle_indexes=20800000 mdd.myth-MDT0000.changelog_max_idle_time=2500000 mdd.myth-MDT0000.changelog_min_free_cat_entries=2 mdd.myth-MDT0000.changelog_min_gc_interval=3600 mdd.myth-MDT0000.changelog_size=3857464008 mdd.myth-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SAT TR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.myth-MDT0000.changelog_users= current index: 98130425 ID index (idle seconds) cl3 77315666 (2512315) It would be better to evict idle changelog users after a week or two, which is plenty of time to get a broken consumer working again, even if the log isn't totally full. |
| Comments |
| Comment by Andreas Dilger [ 27/May/21 ] |
|
I think rather than having a hard age/record cutoff (both of which may go wrong on certain cases) that we instead check "(records x hours > (1<<32))" or similar (in addition to the current "out of space" check). That will handle cases with very old users on systems with few records (30 days x 6M records), or less idle (7 days x 25M records). |
| Comment by Andreas Dilger [ 27/May/21 ] |
|
For users that cannot be evicted (eg. strict audit users) it would be desirable to allow a flag on a per-user basis (default off) that prevents that user from being garbage collected. |
| Comment by Mikhail Pershin [ 09/Sep/21 ] |
|
I am not sure about the following thing - consider there is user supposed to stay long and doing changelog purging by itself, so changelog is not too big. Nowadays it will stay forever doing its job. With new approach it will be dropped after changelog_max_idle_time being considered as 'idle'. I don't think this is expected behavior on our side. You said we could prevent users from aggressive GC when needed, but that would mean we have to consider all old users as such just because we don't know their status. That means such users must be deregistered manually after all. As for flag itself, for the same compatibility reason it should be flag which allows agressive CG for user, the question is what behavior is default while user registering, e.g. --aggressive_gc or --disable_gc option to be introduced
|
| Comment by Andreas Dilger [ 09/Sep/21 ] |
|
The Changelog is not meant to be a lifelong history of all events that have happened to the filesystem, but rather an improvement over "inotify" and similar "watch for current/recent events in the filesystem" that only work if some process is actively watching the filesystem every second. Changelog persistence is mainly to avoid short-term problems if the consumer crashes, or the MDS crashes, but should not be expected to work after weeks or months of idle time. At that point, it is faster to just scan the whole MDT again instead of processing a billion Changelog records, and every Changelog consumer would have to be able to do full filesystem scans to start off anyway. We've had lots of issues with users registering changelog users, and then forgetting them, registering them multiple times and leaving the old ones idle, etc. This results in either the MDT becoming full, or the Changelog becoming full, and preventing all MDT operations from succeeding. The duplicate user registration will be helped by patch https://review.whamcloud.com/43380 " I think there are few important changes needed. The current GC doesn't appear to work properly at all (current stats on 2.14.0 server): # # lctl get_param mdd.*.changelog* mdd.myth-MDT0000.changelog_deniednext=60 mdd.myth-MDT0000.changelog_gc=1 mdd.myth-MDT0000.changelog_max_idle_indexes=20971520 mdd.myth-MDT0000.changelog_max_idle_time=2592000 mdd.myth-MDT0000.changelog_min_free_cat_entries=2 mdd.myth-MDT0000.changelog_min_gc_interval=30 mdd.myth-MDT0000.changelog_size=3884936000 mdd.myth-MDT0000.changelog_mask= MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.myth-MDT0000.changelog_users= current index: 98287389 ID index (idle seconds) cl3 77315666 (12082313) # echo $((98287389 - 77315666)) 20971723 This has been exceeding mdd.*.changelog_max_idle_time=2592000 for months without the user being removed. I also just reduced mdd.*.changelog_max_idle_indexes=20971520 and mdd.*.changelog_min_gc_interval=30 and created enough records to exceed this limit without any GC being triggered. The logic should definitely be changed that if either changelog_max_idle_indexes or changelog_max_idle_time is exceeded then the user should be deregistered, since the default values for these are very conservative (2B indexes, 30 days idle), and should be honored if they are explicitly set to lower values. Secondly, there is the question if we need better default values for the GC limits? We might also consider the case of setting the default: mdd->mdd_changelog_max_idle_indexes = min(CHLOG_MAX_IDLE_INDEXES, total_inodes); since it doesn't make sense to keep 2B changelog records if the MDT only has 100M inodes, and this bounds the upper changelog size by the MDT size (avg. 180 bytes/record, so within the default free bytes per inode). Above that, it would be faster to just reprocess every file in the filesystem. This, combined with the previous change may be enough to avoid 99% of the current problems. In DDN-2174, I suggested "(current_index - idle_index) * idle_seconds / 86400 > 4B" as a reasonable heuristic for deciding if a user is "too idle". That balances cases where there are lots records being created quickly and need to be consumed (e.g. 4B in 2 days ~= 50000/sec continuously) against a user that is idle for a very long time (e.g. 4B in 30 days ~= 1700/sec), but is possibly too complex for users to understand clearly. I think the default has to be that Changelog users will be deregistered if they are idle. That is already supposed to be the case since patch https://review.whamcloud.com/36467 " If there is a need for it, a later patch can add an option --gc-disable for a specific user if this really is a critical consumer. I could imagine that audit records on some very restricted systems may prefer to keep very old Changelog records over having a working filesystem, but really they should just make the audit consumer run properly. It might separately be useful to have an option like --gc-clear or similar that continually clears the oldest records (e.g. to the start of the next Changelog file each time) for the oldest user(s) instead of deregistering the user(s), but I'm not sure there is a specific need for that yet. |
| Comment by Mikhail Pershin [ 09/Sep/21 ] |
|
the key question is how to determine that user is idle, that was what I meant. Can there be changelog user that is active for a long time? E.g. Robingood HSM engine, if it will work with the same registered user more than our max_idle_time then it will be deregistered, because we don't know if user is idle or is still needed. Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age |
| Comment by John Hammond [ 14/Sep/21 ] |
|
> Changelog user cur_time is set upon registration and just compared with current time to decide if changelog is idle or not. So basically it is not even idle metric but age cur_time is also set on changelog_clear. And an active changelog user should be calling changelog_clear periodically. So I think it's OK to consider this an idle metric. |
| Comment by Gerrit Updater [ 27/Sep/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45068 |
| Comment by Gerrit Updater [ 30/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45068/ |
| Comment by Peter Jones [ 30/Nov/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 19/Jan/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46203 |
| Comment by Etienne Aujames [ 19/Jan/22 ] |
|
I have backported the https://review.whamcloud.com/46203 for testing purpose. |
| Comment by Nathan Rutman [ 09/Nov/23 ] |
|
I realize this ticket is closed, but this seems an appropriate place to ask: Deregistration of an idle consumer is a heavy penalty, requiring a user to re-register and restart their consumer process with a new ID. Wouldn't it make more sense to mark this consumer internally as "idle" and simply ignore it during the lowest-unconsumed-record check? Then if consumer does come back to life, it still has access the (remaining) changelog records (and we remove the "idle" flag). Less impact on users for an intermittent consumer. Idle consumers can be reported in changelog_users. And @Mikhail_Pershin's concern about evicting an apparently idle consumer on an idle system. |