[LU-13772] mdt: changelog_deregister takes too long Created: 09/Jul/20  Updated: 09/Sep/21  Resolved: 09/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Quentin Bouget Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-14699 changelog garbage collection is too lax Resolved
is related to LU-14688 Changelog cancel improvement Resolved
Epic/Theme: changelog
Rank (Obsolete): 9223372036854775807

 Description   

We recently had the case of an MDT whose changelog records were not being processed and cleared as they should have been. We quickly reached a point where the whole catalog was full, and we had little choice but to deregister the changelog reader to resume production.

We used lctl --device lustre-MDT0000 changelog_deregister cl1 for that, and it took 3 days to complete. Considering we only had a single changelog reader registered, and our goal was to simply garbage collect every changelog record, it feels wasteful that we should wait 3 days for something that essentially deletes a few files on the MDT.

Would it be possible to speed up this process?

It would be nice that this works by special-casing lctl changelog_deregister when there is only one reader registered, but I think a new command (eg. lctl changelog_delete_everything, lctl changelog_reset, ...) would also be satisfying.



 Comments   
Comment by Quentin Bouget [ 09/Jul/20 ]

This is important to us, because we use robinhood, and during those 3 days, robinhood essentially works in the dark.

We could launch periodic scans on our filesystem, but scans take a while, and they can still miss things.

Comment by Peter Jones [ 10/Jul/20 ]

Quentin

Is this something that you plan to work on or just a suggestion for someone else to work on?

Peter

Comment by Quentin Bouget [ 15/Jul/20 ]

Hi Peter,

Just a suggestion. Although it may become a request one day.
We will sort this out with our support team when/if the time comes.

Comment by John Hammond [ 12/May/21 ]

tappro could you look at this after LU-13055 and in LU-13338?

Comment by Andreas Dilger [ 14/May/21 ]

It makes sense to have a fast-path for deleting changelog records. If the last index in the changelog is less than the lowest user, then all of the records should be deleted, and the whole file can be removed immediately. That would speed up changelog removal by 50000x or so (a few writes to delete the file, instead of 64000 writes to cancel every record.

Comment by Andreas Dilger [ 21/May/21 ]

It looks like this will coincidentally be fixed by patch https://review.whamcloud.com/43719 "LU-14688 mdt: changelog purge deletes plain llog" that was just pushed a few days ago.

Comment by Andreas Dilger [ 09/Sep/21 ]

The patch https://review.whamcloud.com/43719 "LU-14688 mdt: changelog purge deletes plain llog" has been landed for 2.14.52.

Generated at Sat Feb 10 03:04:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.