[LU-14958] configurable hash table size for jbd2 Created: 23/Aug/21  Updated: 15/Sep/23  Resolved: 29/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Improvement Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File dumpe2fs.txt     File jbd2_debugfs.gz    
Issue Links:
Related
is related to LU-17117 debugfs should be able to disable MMP... Open
Rank (Obsolete): 9223372036854775807

 Description   

revoke hash table can be enourmous with multi-GB journal. this may result in millions of revoke records which are loaded and inserted into specific hashtable during journal replay.
currently the revoke hashtable's size is hard-coded as 256, thus every slot may get too many records.

a simple benchmark of that code:
1048576 - 95 seconds
2097152 - 580 seconds
in the fields it can be upto 30M records to find/insert.

with 8192 buckets in the hash table:
4194304 - 59 seconds
8388608 - 247 seconds



 Comments   
Comment by Alex Zhuravlev [ 26/Aug/21 ]

same benchmark with rhashtable:

1048576 in 1854304 (2 s)
2097152 in 3750792 (3 s)
4194304 in 7594669 (7 s)
8388608 in 15385228 (15 s)
16777216 in 31040219 (31 s)
33554432 in 62284580 (62 s)

Comment by Alex Zhuravlev [ 07/Sep/21 ]

pjones, I'm waiting for a feedback from linux-ext4@ and then I plan to make patches for master

Comment by Gerrit Updater [ 04/Oct/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45122
Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cfdf255409758332bb0fc406c8b86578bbf6070b

Comment by Alex Zhuravlev [ 05/Oct/21 ]

I did some measurements and tend to think the current approach should be changed a bit. I ran sanity-benchmark, it took 1724 seconds, all lookups in revoke table were counted for runtime (in contrast with journal replay):
dm-0-8: 4666 lookups 7637 insertions in revoke table
only 7637 lookups and 4666 insertions over 1700 seconds on MDS. currently insertion is simple as:
spin_lock(&journal->j_revoke_lock);
list_add(&record->hash, hash_list);
spin_unlock(&journal->j_revoke_lock);
and this must be cheaper than an insertion into rhashtable which does lookup internally. I think there will be runtime performance penalty if we switch to rhashtable. probably we need a more flexible schema when rhashtable is used during journal replay and a simple list during regular operations.

Comment by Etienne Aujames [ 10/Nov/21 ]

Hello,

The CEA hit that issue on a ClusterStore version "2.12.4.2_cray" with a 4G journal:

Robinhood stops to dequeue changelogs because of LU-14158. After one night changelog_catalog was 50% full.
We try to deregister the changelog but the cancel process was too long (LU-14688). So we decide to crash the MDT.
We were unable to mount the target. We identify that mount process got stuck in journal recovery with "perf" (find_revoke_record).
We try e2fsck and it got stuck in journal recovery too. We identify (with strace) that the recovery process is not stuck but too slow (several days)

To get back the filesystem in production quickly (with current IO losses):

  • we save the raw journal into a file
  • remove the need_recovery and has_journal flag
  • e2fsck -f to apply the journal removing and correct inconsistencies
  • reapply the has_journal ext4 flag to recreate the journal
  • kill/crash all the current jobs (current IOs will be lost)
  • mount lustre
  • execute an LFSCK on the MDT to correct lustre inconsistencies

There was >64M revoke entries inside the jbd2 journal.
For now we think this is the changelog_deregister that causes the creation of so many revoke records.

Comment by Etienne Aujames [ 12/Nov/21 ]

I successfully reproduced the issue on a 2.12.7 LTS with changelog_deregister and with a 1228M journal:

I executed changelog_deregister with 17794/64768 entries in changelog_catalog (27%, 17794 llog plain files).
The revoke records appeared to increase linearly while removing llog plain: 2150 revoke_entries/s with a removing speed of -56 llog/min
When I crashed the MDS there was ~20M of revoke entries in the journal, the journal recovery took about 40min.

Comment by Alex Zhuravlev [ 12/Nov/21 ]

many thanks for the report, I'm going to try this locally.

Comment by Etienne Aujames [ 15/Nov/21 ]

I retest the case with a 2.12.7 + LU-14688's patch:

I executed changelog_deregister with 17794/64768 entries in changelog_catalog (27%, 17794 llog plain files).
The revoke records grow linearly while removing llog plain: 109 revoke_entries/s with a removing speed of -2160 llog/min
When I crashed the MDS there was ~50k of revoke entries in the journal, the journal recovery took about 1s.

So the LU-14688 seems to resolve the issue for "changelog_deregister".

Comment by Alex Zhuravlev [ 15/Nov/21 ]

in case you have time/wish, could you please try to collect more data? with NO LU-14688 patch make changelog_deregister with a very high number of changelog records (like in your example) and dump cat /proc/fs/jbd2/<mds device>/info just before crashing MDS?
the revoke records are generated when metadata blocks are being freed, so there must be same number of revoke records but time distribution is very different, I guess.

Comment by Etienne Aujames [ 16/Nov/21 ]

Hello,
I have retested without the LU-14688 (it is same set of changelogs each time: I untar an archive to restore changelog state).
This time, I waited the end of the deregister (like in the test with LU-14688). It took 154min to complete.
Here the "/proc/fs/jbd2/info" before the crash:

1858 transactions (1850 requested), each up to 262144 blocks
average: 
  0ms waiting for transaction
  0ms request delay
  4988ms running transaction
  8ms transaction was being locked
  0ms flushing data (in ordered mode)
  32ms logging transaction
  43059us average transaction commit time
  392638 handles per transaction
  49 blocks per transaction
  50 logged blocks per transaction

Before the crash there was 30489150 revoke records in the on disk journal (count with "logdump -a", see jbd2_debugfs.gz).
The recovery took 98min.

I added a "perf probe" on insert_revoke_hash() and I record the some events during the deregister:

   100.00%   100.00%  (ffffffffc0d3ea10)
            |
            ---__GI___ioctl
		system_call_fastpath+37
		sys_ioctl+161
		do_vfs_ioctl+928
		obd_class_ioctl+210
		class_handle_ioctl+6349
		mdt_iocontrol+1516
		mdd_iocontrol+966
		mdd_changelog_clear+1619
		mdd_changelog_llog_cancel+208
		llog_changelog_cancel+84
		llog_cat_process+46
		llog_cat_process_or_fork+481
		llog_process_or_fork+188
		llog_process_thread+2143
		llog_cat_process_cb+151
		llog_handle_put+72
		llog_osd_close+78
		lu_object_put+197
		lu_object_free+104
		osd_object_delete+468
		iput+252
		evict+180
		ldiskfs_evict_inode+1418
		ldiskfs_truncate+784
               |          
               |--60.87%--ldiskfs_ind_truncate+689
               |          |          
               |           --60.83%--ldiskfs_free_branches
               |                     |          
               |                      --60.75%--ldiskfs_free_branches+247
               |                                ldiskfs_free_branches+653
               |                                |          
               |                                |--36.95--ldiskfs_free_data+323
							  ldiskfs_clear_blocks+401
							  ldiskfs_free_blocks+411
							  __ldiskfs_forget+290
							  insert_revoke_hash+1
               |                                |          
               |                                |          
               |                                |          
               |                                |          
               |                                |          
               |                                 --23.81%--ldiskfs_free_data+275
							   ldiskfs_clear_blocks+401
							   ldiskfs_free_blocks+411
							   __ldiskfs_forget+290
							   insert_revoke_hash+1
               |          
                --38.68%--ldiskfs_ind_truncate+785
                          |          
				ldiskfs_free_branches+247
				ldiskfs_free_branches+653
				ldiskfs_free_data+323
				ldiskfs_clear_blocks+401
				ldiskfs_free_blocks+411
				__ldiskfs_forget+290
				insert_revoke_hash+1

Something I noticed is that the on disks revokes records number is stable after the deregister (even if we do some op on the MDT). It look like some kind of leak for me, but I don't understand when this records are supposed to be cancelled.

Comment by Etienne Aujames [ 16/Nov/21 ]

Concerning the difference between the tests with and without the LU-14688. At the end of the the changelog_deregister:

  • with the LU-14688: I have 50k on disk revoke records ( ~3 revoke records per llog plain)
  • without the LU-14688: I have 30M on disk revoke records ( ~1700 revoke records per llog plain)
Comment by Etienne Aujames [ 16/Nov/21 ]

Here the "dumpe2fs -h" after recovery: dumpe2fs.txt

Comment by Alex Zhuravlev [ 16/Nov/21 ]
1858 transactions (1850 requested), each up to 262144 blocks
average: 
  0ms waiting for transaction
  0ms request delay
  4988ms running transaction
  8ms transaction was being locked
  0ms flushing data (in ordered mode)
  32ms logging transaction
  43059us average transaction commit time
  392638 handles per transaction
  49 blocks per transaction
  50 logged blocks per transaction

thanks a lot! and, if possible, the same with LU-14688 applied if you have few spare cycles. thanks in advance!

I think this confirms the theory that basically "deregistiring" is CPU bound and produces a lot of tiny transaction which aren't checkpointed (I guess there is no need to do so - no memory pressure as number of modified blocks is tiny). given they aren't checkpointed by the crash, JBD has to replay them all and need to skip revoked blocks so fill/lookup big revoke table.
the situation changes with LU-14688 as the process is less CPU-bound and generates bigger transactions and (somehow) transactions get checkpointed more frequently leaving less to replay.

the important question here is whether we still need to fix JBD.. I tend to think so as there are another use cases when external processes (like llsom_sync) may consume and clear changelog at high rate and this would result in tiny transactions as before LU-14688 patch.

Comment by Etienne Aujames [ 16/Nov/21 ]

Here the "/proc/fs/jbd2/info" with the LU-14688:

131 transactions (127 requested), each up to 262144 blocks
average: 
  0ms waiting for transaction
  0ms request delay
  4770ms running transaction
  13ms transaction was being locked
  0ms flushing data (in ordered mode)
  60ms logging transaction
  91593us average transaction commit time
  826 handles per transaction
  372 blocks per transaction
  373 logged blocks per transaction
Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45122/
Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c3bb2b778d6b40a5cecb01993b55fcc107305b4a

Comment by Gerrit Updater [ 24/Apr/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50730
Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 53e110a884423b137b44ea44b0f1327b1535bfa1

Comment by Gerrit Updater [ 29/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50730/
Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 52bc546e91e415315e6cf9a46608264122e64ef3

Comment by Peter Jones [ 29/Apr/23 ]

Seems to be merged for 2.15.3 and 2.16

Comment by Andreas Dilger [ 15/Sep/23 ]

It looks like this patch solved the problem with the kernel journal revoke record handling, but there is still a similar problem in e2fsprogs handling of journal replay. We hit an issue with slow journal recovery while deleting a large number of changelogs, and "tune2fs" and "e2fsck" were hung in journal recovery for hours before being interrupted. The problem was eventually fixed by updating server to include this fix in ldiskfs and then mounting the filesystem to do journal recovery in the kernel.

Generated at Sat Feb 10 03:14:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.