[LU-7340] ChangeLogs catalog full condition should be handled more gracefully Created: 26/Oct/15  Updated: 14/May/21  Resolved: 17/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.5.4
Fix Version/s: Lustre 2.11.0, Lustre 2.10.3

Type: Improvement Priority: Critical
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-1586 no free catalog slots for log Resolved
Related
is related to LU-9055 MDS crash due to changelog being full Open
is related to LU-8856 ZFS-MDT 100% full. Cannot delete files. Resolved
is related to LU-10527 LustreError: 7830:0:(llog_cat.c:313:l... Resolved
is related to LU-10680 MDT becoming unresponsive in 2.10.3 Resolved
is related to LU-12871 enable changelog garbage collection b... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Presently when a LLOG Catalog wraps and its latest assigned index collides with its oldest and still in use index, ENOSPC is returned and the caller just ignore the fact that LLOG record could not be written.

For ChangeLogs specific usage, some actions could be attempted to recover space/records, some ideas have already been detailed in LU-6556, but it seems better to address them in this separate ticket.

Input from Andreas :
I think the other thing that is needed here is to automatically unregister ChangeLog watcher(s) if the changelog is full or the MDS runs out of space (by default), or block all MDS operations until the ChangeLog can be written (if /proc tunable is set to make ChangeLog updates mandatory). It should unregister starting with the oldest watcher on the assumption that the older watcher was forgotten and newer ones are still running, and that this will release the most space. The unregistration should cancel records up to the next watcher, or all remaining records if no other watchers are left.

Input from Robert :
I suggest going a step further and proactively remove stale watchers after a configurable period or when hitting a max watermark to try o avoid running out of space. Also, being unregistered is a reasonable notification to the application that they've lost their changelog feed and need to resync.



 Comments   
Comment by Bruno Faccini (Inactive) [ 23/Mar/17 ]

Sorry to be late on this, but I should be able to provide a fix/solution soon now.

Comment by Gerrit Updater [ 12/May/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/27103
Subject: LU-7340 mdd: changelogs garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b7d9cc253398f8f1a07038c351a2ce1a8969d834

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27103/
Subject: LU-7340 mdd: changelogs garbage collection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3442db6faf685fbdbd092bdfdc8d273e4990a141

Comment by Peter Jones [ 17/Dec/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 18/Dec/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30573
Subject: LU-7340 mdd: changelogs garbage collection
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 12c8163c17149d2afed78e2ba84da624dd920b34

Comment by Gerrit Updater [ 20/Dec/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30573/
Subject: LU-7340 mdd: changelogs garbage collection
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: b87511a0578a03447c51a8495966d60c90fcee61

Generated at Sat Feb 10 02:08:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.