Description
The jbd2 journal file/device may contain millions of revoke records in some cases where the journal size is large (up to 4GB in Lustre MDTs) and hundreds of GB of files are deleted at once (resulting in millions of metadata blocks being freed and journaled and revoked), and then the server crashes before the transactions with the revoke records can be checkpointed.
The patch https://review.whamcloud.com/45122 "LU-14958 kernel: use rhashtable for revoke records in jbd2" fixed the Lustre-patched kernel to use a dynamic hash table size (not in upstream ext4 yet), but the e2fsck/revoke.c copy of the kernel code is still using a fixed-size hash table of 1024 entries (it was previously 256 in the kernel).
At a minimum, the e2fsck/revoke.c code should select a hashtable size based on the number of blocks in the journal, so that it is large enough not to overflow the hash buckets if there are a lot of revoke records in the journal. Using 1024 was reasonable when the journal size was itself only a few thousand blocks (8-32MB = 2024-4096 blocks), but with MDT journal sizes of 4GB this should be much larger (1M blocks in the journal). We don't need millions of buckets in the hash table, but at least 100k for these large journals would be reasonable (max 10 entries per bucket instead of 1000 for 1M revoke records). Something like max(journal_blocks / 10, 1024) looks reasonable.
Each hash bucket is a struct list_head = two pointers, so 16 bytes per bucket means 1.6MB for a 100k entry hash table. The size of the hash table items is irrelevant since they need to be allocated regardless of whether they are in a single list or in 100k lists. Compared to the 4GB journal (which must also be kept in RAM) this memory is minimal. The hash table memory usage could be cut in half by using singly-linked struct hlist_head instead of doubly-linked struct list_head for the hash table buckets. This would need updating lib/ext2fs/kernel-list.h with a newer kernel include/linux/list.h to get struct hlist_head and associated list handling macros, but I don't think that is critical.
Ideally, e2fsck could also use a dynamic rhashtable implementation, but that is considerably more work and the added complexity isn't clearly a benefit vs. the relative effort for just making the hash table larger. This allocation is only needed for a short time and will be freed after journal replay, unlike the kernel where this memory is also used during runtime to track journal revoke records during operation.
Attachments
Issue Links
- is related to
-
LU-14958 configurable hash table size for jbd2
- Resolved