[LU-11416] Improve readahead for random read of small/medium files Created: 21/Sep/18  Updated: 11/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: perf_optimization, readahead

Issue Links:
Related
is related to LU-2032 small random read i/o performance reg... Open
is related to LU-5561 Lustre random reads: 80% performance ... Resolved
is related to LU-8709 parallel asynchronous readahead Resolved
is related to LU-11657 Prefetch whole ZFS block into client ... Open
is related to LU-15100 Add ability to tune definition of loo... Open
Rank (Obsolete): 9223372036854775807

 Description   

For files that fit comfortably within the client RAM it would be possible to tune the readahead algorithm to fetch the entire file into the client RAM even with a totally random read workload (see e.g. LU-5561).

This behaviour can be simulated to some effect by setting the following parameters:

lctl set_param llite.*.max_read_ahead_mb=16384 \
        llite.${Lustre_NID}\*.max_read_ahead_per_file_mb=1024 \
        llite.${Lustre_NID}\*.max_read_ahead_whole_mb=512

or similar. This causes the Lustre client to read all file data into cache if the file is <= 512MiB in size and is accessed more than once. Performance testing shows that this can improve random 4KB reads dramatically compared to the default settings (64MiB, 64MiB, 2MiB respectively - 7574016 IOPS vs. 175616 IOPS.

On the one hand, this might be considered "cheating" to read the whole file into RAM before doing random IOPS, but on the other hand if this behaviour can be generalized and is automatic for such workloads it is no more of a "cheat" than readahead or write-behind optimizations for a sequential workload.

The main drawback from just increasing max_read_ahead_whole_mb significantly is that this can have negative effects on some workloads that are only reading a small amount of data from many large files, such as reading an index/header from output files to find parameters, or file/finder/etc that read only the head/tail of a file to determine content.

A reasonable compromise is to implement the "read ahead whole file" heuristics for files that have a significant number of random accesses, but can easily fit within the client RAM. As a default starting point, we could add max_read_ahead_random_whole_mb an max_read_ahead_random_ratio parameters that encode this behaviour specifically. The default file size could be something reasonable like totalram_pages / num_online_cpus / 2 (e.g. 1820MB for a 36-core 128GB client) and a ratio of 1/4096 pages read before triggering readahead.

That would mean a 1GB file would be fully prefetched after 64 random pages were read, and the maximum-sized 1820MB file would be fully prefetched after 113 pages are read.


Generated at Sat Feb 10 02:43:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.