|
Some thoughts on read policy, decided by the clients:
- any replica marked LCME_FL_STALE should be avoided
- any replica marked LCME_FL_PREF_RD should be selected first
- reads from OSTs marked OS_STATE_NONROT should be preferred (
LU-14996)
- reads from the local NID and/or local LNet network should be preferred
- reads from OSTs with higher performance/lower load should be preferred (LU-7880)
- it is desirable to maximize re-use of page cache on the OST, so small files should deterministically pick the same replica (eg. first one)
- it is desirable to spread the cache for large files across OSTs to share the RAM and network bandwidth, so clients should deterministically round-robin reads for large files across replicas (e.g. every 1GB)
- if there are a large number of equivalent replicas of a file (eg. more than 3?), clients should deterministically evenly distribute their selection of the mirror by e.g. their client NID modulo mirror count (e.g. see patch https://review.whamcloud.com/29136 lmv_select_statfs_mdt())
For selection of which mirror to use for writes, decided by the MDS:
- any mirror marked LCME_FL_STALE must be avoided
- any replica(s) marked LCME_FL_PREF_WR should be selected first
- replicas on flash OSTs (OS_STATFS_NONROT) should be preferred
- prefer OSTs that are close network-wise to the client (same LNet network, in case of local/remote mirrors)
- writes to OSTs with higher performance/lower load should be preferred (LU-7880)
currently it is not totally safe for writes to the local NID (until we get a "no recovery sync write export" for local clients), so a remote OST should typically be used (LU-12722)
- allow specifying a "domain" number for each OST (e.g. rack, PSU, network switch), and avoid allocating replicas from the same domain. "domain" is just an arbitrary integer assigned by the admin to indicate groups of OSTs that share a common point of failure
|