[LU-13772] mdt: changelog_deregister takes too long Created: 09/Jul/20 Updated: 09/Sep/21 Resolved: 09/Sep/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Quentin Bouget | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Epic/Theme: | changelog | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We recently had the case of an MDT whose changelog records were not being processed and cleared as they should have been. We quickly reached a point where the whole catalog was full, and we had little choice but to deregister the changelog reader to resume production. We used lctl --device lustre-MDT0000 changelog_deregister cl1 for that, and it took 3 days to complete. Considering we only had a single changelog reader registered, and our goal was to simply garbage collect every changelog record, it feels wasteful that we should wait 3 days for something that essentially deletes a few files on the MDT. Would it be possible to speed up this process? It would be nice that this works by special-casing lctl changelog_deregister when there is only one reader registered, but I think a new command (eg. lctl changelog_delete_everything, lctl changelog_reset, ...) would also be satisfying. |
| Comments |
| Comment by Quentin Bouget [ 09/Jul/20 ] |
|
This is important to us, because we use robinhood, and during those 3 days, robinhood essentially works in the dark. We could launch periodic scans on our filesystem, but scans take a while, and they can still miss things. |
| Comment by Peter Jones [ 10/Jul/20 ] |
|
Quentin Is this something that you plan to work on or just a suggestion for someone else to work on? Peter |
| Comment by Quentin Bouget [ 15/Jul/20 ] |
|
Hi Peter, Just a suggestion. Although it may become a request one day. |
| Comment by John Hammond [ 12/May/21 ] |
|
tappro could you look at this after |
| Comment by Andreas Dilger [ 14/May/21 ] |
|
It makes sense to have a fast-path for deleting changelog records. If the last index in the changelog is less than the lowest user, then all of the records should be deleted, and the whole file can be removed immediately. That would speed up changelog removal by 50000x or so (a few writes to delete the file, instead of 64000 writes to cancel every record. |
| Comment by Andreas Dilger [ 21/May/21 ] |
|
It looks like this will coincidentally be fixed by patch https://review.whamcloud.com/43719 " |
| Comment by Andreas Dilger [ 09/Sep/21 ] |
|
The patch https://review.whamcloud.com/43719 " |