[LU-16169] parallel e2fsck pass1 balanced group distribution Created: 19/Sep/22  Updated: 07/Dec/23  Resolved: 07/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: e2fsck

Issue Links:
Related
is related to LU-14894 Parallel pass2 support for e2fsck Open
is related to LU-14213 enable parallel e2fsck by default Open
is related to LU-16170 parallel e2fsck summary inode count i... Open
Rank (Obsolete): 9223372036854775807

 Description   

When running e2fsck with multiple threads (e.g. "-m 32") there are currently an equal number of groups assigned to each thread (groups_count / num_threads). However, since the number of inodes in each group is uneven, this results in some threads doing far more work during pass1, which takes them much longer to complete:

Pass 1: Checking inodes, blocks, and sizes
[Thread 0] Scan group range [0, 1328)
[Thread 1] Scan group range [1328, 2656)
[Thread 2] Scan group range [2656, 3984)
:
:
[Thread 30] Scan group range [39840, 41168)
[Thread 31] Scan group range [41168, 42615)
[Thread 20] Pass 1: Memory used: 17224k/237268k (16059k/1165k), time: 107.31/120.13/345.32
[Thread 20] Pass 1: I/O read: 2265MB, write: 0MB, rate: 21.11MB/s
[Thread 20] Scanned group range [26560, 27888), inodes 2318941
[Thread 12] Pass 1: Memory used: 17224k/237268k (15959k/1266k), time: 107.69/120.49/346.50
[Thread 12] Pass 1: I/O read: 2248MB, write: 0MB, rate: 20.88MB/s
[Thread 12] Scanned group range [15936, 17264), inodes 2300847
:
:
[Thread 0] Pass 1: Memory used: 22404k/249936k (18332k/4073k), time: 955.69/318.00/1483.58
[Thread 0] Pass 1: I/O read: 22356MB, write: 0MB, rate: 23.39MB/s
[Thread 0] Scanned group range [0, 1328), inodes 22856885
[Thread 22] Pass 1: Memory used: 23388k/249936k (19317k/4072k), time: 1189.31/359.09/1751.43
[Thread 22] Pass 1: I/O read: 29900MB, write: 0MB, rate: 25.14MB/s
[Thread 22] Scanned group range [29216, 30544), inodes 30342690
[Thread 27] Pass 1: Memory used: 23388k/258768k (19226k/4163k), time: 1567.00/417.52/2140.94
[Thread 27] Pass 1: I/O read: 36898MB, write: 0MB, rate: 23.55MB/s
[Thread 27] Scanned group range [35856, 37184), inodes 37782784
:
:
[Thread 26] Pass 1: Memory used: 41720k/53936k (16911k/24810k), time: 1788.72/445.44/2332.17
[Thread 26] Pass 1: I/O read: 42476MB, write: 0MB, rate: 23.75MB/s
[Thread 26] Scanned group range [34528, 35856), inodes 43494656
[Thread 31] Pass 1: Memory used: 42360k/15692k (15264k/27097k), time: 1907.30/446.44/2342.45
[Thread 31] Pass 1: I/O read: 45931MB, write: 0MB, rate: 24.08MB/s
[Thread 31] Scanned group range [41168, 42615), inodes 47032901

In the above example, while each thread is assigned 1329 groups, some threads only process ~2.5M inodes and complete in ~100s, while other threads have over 40M inodes assigned and take ~1800s to complete. This works out to be roughly 24k inodes/sec for each of the threads, regardless of how many inodes are processed. If the 545M inodes were evenly distributed across the threads in this case, pass1 could have finished in about 705s instead of 1907s.

Groups must currently be allocated consecutively to each thread in order to more easily manage in-memory state, so it wouldn't be very easy to have a producer-consumer model where threads process one group at a time on an as-available basis.

To more evenly distribute inodes across the pass1 threads, one option would be to calculate the average number of inodes per thread (about 545M/32=17M in this case), and then walk groups consecutively and accumulate the inode count until approximately the average number of inodes are assigned to a thread (within average_inodes_per_group / 2 below the average, or if the average is exceeded). This would use the used inodes count in the group descriptors, maybe with some maximum number of groups per thread like 5x total_groups / num_threads to avoid issues if the group descriptors are corrupted, possibly reverting to "equal" group subdivision if this doesn't work out.

This will more evenly distribute the inodes, and hence runtime, to each thread and should reduce overall pass1 execution time.



 Comments   
Comment by Gerrit Updater [ 08/Oct/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/48806
Subject: LU-16169 e2fsck: improve parallel thread balance
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 2c6ea4f08e99dc13d30052bae21837720b88bd47

Comment by Gerrit Updater [ 29/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/48806/
Subject: LU-16169 e2fsck: improve parallel thread balance
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 4e82819edcafbdd3bb21fde9d86b0a6a80dfcf3d

Comment by Andreas Dilger [ 30/Nov/23 ]

There is something wrong with the balancing of the groups in e2fsck:

# e2fsck -fn -m 8 /dev/vgmyth/lvmythmdt0.ssd
e2fsck 1.47.0-wc5 (27-Sep-2023)
Warning!  /dev/vgmyth/lvmythmdt0.ssd is in use.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
[Thread 0] Scan group range [0, 20), inode_count = 655358/655360
[Thread 1] Scan group range [20, 40), inode_count = 655360/655360
[Thread 2] Scan group range [40, 66), inode_count = 647340/655360
[Thread 3] Scan group range [66, 92), inode_count = 651555/655360
[Thread 4] Scan group range [92, 112), inode_count = 655360/655360
[Thread 5] Scan group range [112, 160), inode_count = 524320/655360
[Thread 6] Scan group range [160, 160), inode_count = 0/655360
[Thread 6] Scanned group range [160, 160), inodes 0/0
[Thread 7] Scan group range [160, 160), inode_count = 0/655360
Comment by Gerrit Updater [ 30/Nov/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/tools/e2fsprogs/+/53292
Subject: LU-16169 e2fsck: fix parallel thread balance
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 470e360de92c82129637a59c2daff40fd6af0430

Comment by Gerrit Updater [ 06/Dec/23 ]

"Li Dongyang <dongyangli@ddn.com>" merged in patch https://review.whamcloud.com/c/tools/e2fsprogs/+/53292/
Subject: LU-16169 e2fsck: fix parallel thread balance
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: f4ba833854aceb430f4ded14789d55eb4a30b4f6

Comment by Andreas Dilger [ 07/Dec/23 ]

Another example of very imbalanced e2fsck distribution with e2fsck-1.47.0-wc4 before the patch was applied:

Pass 1: Checking inodes, blocks, and sizes
[Thread 0] Scan group range [0, 1595904)
[Thread 1] Scan group range [1595904, 3191808)
[Thread 2] Scan group range [3191808, 4787712)
[Thread 3] Scan group range [4787712, 6383616)
[Thread 1] Pass 1: Memory used: 42928k/724200k (28658k/14271k), time:  0.52/ 1.91/ 0.08
[Thread 1] Pass 1: I/O read: 1MB, write: 0MB, rate: 1.92MB/s
[Thread 1] Scanned group range [1595904, 3191808), inodes 1
[Thread 3] Pass 1: Memory used: 42928k/855876k (28647k/14282k), time:  0.68/ 2.23/ 0.11
[Thread 3] Pass 1: I/O read: 1MB, write: 0MB, rate: 1.47MB/s
[Thread 3] Scanned group range [4787712, 6383616), inodes 1
[Thread 2] Pass 1: Memory used: 54884k/864088k (47076k/7809k), time: 19.02/ 4.05/ 0.41
[Thread 2] Pass 1: I/O read: 9MB, write: 0MB, rate: 0.47MB/s
[Thread 2] Scanned group range [3191808, 4787712), inodes 1058
[Thread 0] Pass 1: Memory used: 422932k/864088k (421572k/1361k), time: 1743.93/98.29/11.38
[Thread 0] Pass 1: I/O read: 24030MB, write: 1MB, rate: 13.78MB/s
[Thread 0] Scanned group range [0, 1595904), inodes 46070748

This had three threads take a total of 20s before completing their (very few) assigned groups, while thread 0 took take 1744s (88x longer). With proper balancing all of the threads could have finished this in 441s or less (1/4 of the time). The number of inodes processed by each thread is proportional to the amount of data read, and thread 0 in this case read 2184x as much data as the other 3 threads together. It may be that the speedup would be even more than 4x since the threads would have been performing overlapping IO and compute, so would not have been idle waiting for IO completion.

Generated at Sat Feb 10 03:24:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.