[LU-14958] configurable hash table size for jbd2 Created: 23/Aug/21 Updated: 15/Sep/23 Resolved: 29/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
revoke hash table can be enourmous with multi-GB journal. this may result in millions of revoke records which are loaded and inserted into specific hashtable during journal replay. a simple benchmark of that code: with 8192 buckets in the hash table: |
| Comments |
| Comment by Alex Zhuravlev [ 26/Aug/21 ] |
|
same benchmark with rhashtable: 1048576 in 1854304 (2 s) |
| Comment by Alex Zhuravlev [ 07/Sep/21 ] |
|
pjones, I'm waiting for a feedback from linux-ext4@ and then I plan to make patches for master |
| Comment by Gerrit Updater [ 04/Oct/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45122 |
| Comment by Alex Zhuravlev [ 05/Oct/21 ] |
|
I did some measurements and tend to think the current approach should be changed a bit. I ran sanity-benchmark, it took 1724 seconds, all lookups in revoke table were counted for runtime (in contrast with journal replay): |
| Comment by Etienne Aujames [ 10/Nov/21 ] |
|
Hello, The CEA hit that issue on a ClusterStore version "2.12.4.2_cray" with a 4G journal: Robinhood stops to dequeue changelogs because of To get back the filesystem in production quickly (with current IO losses):
There was >64M revoke entries inside the jbd2 journal. |
| Comment by Etienne Aujames [ 12/Nov/21 ] |
|
I successfully reproduced the issue on a 2.12.7 LTS with changelog_deregister and with a 1228M journal: I executed changelog_deregister with 17794/64768 entries in changelog_catalog (27%, 17794 llog plain files). |
| Comment by Alex Zhuravlev [ 12/Nov/21 ] |
|
many thanks for the report, I'm going to try this locally. |
| Comment by Etienne Aujames [ 15/Nov/21 ] |
|
I retest the case with a 2.12.7 + I executed changelog_deregister with 17794/64768 entries in changelog_catalog (27%, 17794 llog plain files). So the |
| Comment by Alex Zhuravlev [ 15/Nov/21 ] |
|
in case you have time/wish, could you please try to collect more data? with NO |
| Comment by Etienne Aujames [ 16/Nov/21 ] |
|
Hello, 1858 transactions (1850 requested), each up to 262144 blocks average: 0ms waiting for transaction 0ms request delay 4988ms running transaction 8ms transaction was being locked 0ms flushing data (in ordered mode) 32ms logging transaction 43059us average transaction commit time 392638 handles per transaction 49 blocks per transaction 50 logged blocks per transaction Before the crash there was 30489150 revoke records in the on disk journal (count with "logdump -a", see jbd2_debugfs.gz I added a "perf probe" on insert_revoke_hash() and I record the some events during the deregister: 100.00% 100.00% (ffffffffc0d3ea10)
|
---__GI___ioctl
system_call_fastpath+37
sys_ioctl+161
do_vfs_ioctl+928
obd_class_ioctl+210
class_handle_ioctl+6349
mdt_iocontrol+1516
mdd_iocontrol+966
mdd_changelog_clear+1619
mdd_changelog_llog_cancel+208
llog_changelog_cancel+84
llog_cat_process+46
llog_cat_process_or_fork+481
llog_process_or_fork+188
llog_process_thread+2143
llog_cat_process_cb+151
llog_handle_put+72
llog_osd_close+78
lu_object_put+197
lu_object_free+104
osd_object_delete+468
iput+252
evict+180
ldiskfs_evict_inode+1418
ldiskfs_truncate+784
|
|--60.87%--ldiskfs_ind_truncate+689
| |
| --60.83%--ldiskfs_free_branches
| |
| --60.75%--ldiskfs_free_branches+247
| ldiskfs_free_branches+653
| |
| |--36.95--ldiskfs_free_data+323
ldiskfs_clear_blocks+401
ldiskfs_free_blocks+411
__ldiskfs_forget+290
insert_revoke_hash+1
| |
| |
| |
| |
| |
| --23.81%--ldiskfs_free_data+275
ldiskfs_clear_blocks+401
ldiskfs_free_blocks+411
__ldiskfs_forget+290
insert_revoke_hash+1
|
--38.68%--ldiskfs_ind_truncate+785
|
ldiskfs_free_branches+247
ldiskfs_free_branches+653
ldiskfs_free_data+323
ldiskfs_clear_blocks+401
ldiskfs_free_blocks+411
__ldiskfs_forget+290
insert_revoke_hash+1
Something I noticed is that the on disks revokes records number is stable after the deregister (even if we do some op on the MDT). It look like some kind of leak for me, but I don't understand when this records are supposed to be cancelled. |
| Comment by Etienne Aujames [ 16/Nov/21 ] |
|
Concerning the difference between the tests with and without the |
| Comment by Etienne Aujames [ 16/Nov/21 ] |
|
Here the "dumpe2fs -h" after recovery: dumpe2fs.txt |
| Comment by Alex Zhuravlev [ 16/Nov/21 ] |
thanks a lot! and, if possible, the same with I think this confirms the theory that basically "deregistiring" is CPU bound and produces a lot of tiny transaction which aren't checkpointed (I guess there is no need to do so - no memory pressure as number of modified blocks is tiny). given they aren't checkpointed by the crash, JBD has to replay them all and need to skip revoked blocks so fill/lookup big revoke table. the important question here is whether we still need to fix JBD.. I tend to think so as there are another use cases when external processes (like llsom_sync) may consume and clear changelog at high rate and this would result in tiny transactions as before |
| Comment by Etienne Aujames [ 16/Nov/21 ] |
|
Here the "/proc/fs/jbd2/info" with the 131 transactions (127 requested), each up to 262144 blocks average: 0ms waiting for transaction 0ms request delay 4770ms running transaction 13ms transaction was being locked 0ms flushing data (in ordered mode) 60ms logging transaction 91593us average transaction commit time 826 handles per transaction 372 blocks per transaction 373 logged blocks per transaction |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45122/ |
| Comment by Gerrit Updater [ 24/Apr/23 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50730 |
| Comment by Gerrit Updater [ 29/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50730/ |
| Comment by Peter Jones [ 29/Apr/23 ] |
|
Seems to be merged for 2.15.3 and 2.16 |
| Comment by Andreas Dilger [ 15/Sep/23 ] |
|
It looks like this patch solved the problem with the kernel journal revoke record handling, but there is still a similar problem in e2fsprogs handling of journal replay. We hit an issue with slow journal recovery while deleting a large number of changelogs, and "tune2fs" and "e2fsck" were hung in journal recovery for hours before being interrupted. The problem was eventually fixed by updating server to include this fix in ldiskfs and then mounting the filesystem to do journal recovery in the kernel. |