Details
-
Technical task
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.16.0
-
3
-
9223372036854775807
Description
When selecting a mirror to read from, the client will examine all of the mirrors:
- any mirror component marked LCME_FL_STALE should be skipped
- any mirror component marked LCME_FL_PREF_RD should be selected first (
LU-10282) - any mirror component on OSTs marked OS_STATE_NONROT should be preferred (
LU-14996)
If there are multiple mirrors matching all of these criteria, then it is likely that the mirrors were created for performance or availability reasons, rather than tiered OST storage (which would likely be excluded by OS_STATE_NONROT and/or LCME_FL_PREF_RD).
In this case, it is desirable to maximize usage of page cache on the OSS nodes. For small files (e.g. <= 128MiB), the clients should deterministically pick the same replica (eg. first one) and always read from the same mirror copy, on the assumption that there are other "small" files being accessed concurrently and the aggregate system performance is maximized by caching different files in each OSS node's RAM.
For larger files, it is desirable to spread the read workload across multiple OSS nodes to better utilize the RAM and network bandwidth. Clients should deterministically round-robin reads for large files across replicas (e.g. every 1GiB). If there are a large number of equivalent replicas of a file (eg. more than 2 or 3?), clients should deterministically evenly distribute their selection of the mirror by e.g. (client NID + offset in GiB) modulo mirror count (e.g. see patch https://review.whamcloud.com/29136 implementation of lmv_select_statfs_mdt() for how clients distribute MDT_STATFS RPCs across MDS nodes).