[LU-10158] FLR: Define a replica choosing policy function Created: 25/Oct/17  Updated: 19/May/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: FLR2

Issue Links:
Related
is related to LU-15834 "lfs mirror extend" should take curre... Open
is related to LU-7880 add performance statistics to obd_statfs Open
is related to LU-11963 Add nonrotational flag to obd_statfs Resolved
is related to LU-12722 exclude local client mounted on MDS/O... Resolved
is related to LU-14996 select preferred mirror using non-rot... Resolved
is related to LU-12649 Tracker for ongoing FLR improvements Open
is related to LU-10448 policy to pick a primary for mirrored... Resolved
is related to LU-9007 Improved object allocator for FLR com... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

For a mirrored file, when it is written at the first time, the policy to pick a replica as primary is primitive in current implementation, either it just chooses the first replica or a random one. A policy function should be defined for replica choosing. At least it should avoid the replicas with unavailable OSTs at this stage.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 25/Oct/17 ]

This should be implemented in phase 2 of FLR.

Comment by Andreas Dilger [ 25/Oct/17 ]

Some thoughts on read policy, decided by the clients:

  • any replica marked LCME_FL_STALE should be avoided
  • any replica marked LCME_FL_PREF_RD should be selected first
  • reads from OSTs marked OS_STATE_NONROT should be preferred (LU-14996)
  • reads from the local NID and/or local LNet network should be preferred
  • reads from OSTs with higher performance/lower load should be preferred (LU-7880)
  • it is desirable to maximize re-use of page cache on the OST, so small files should deterministically pick the same replica (eg. first one)
  • it is desirable to spread the cache for large files across OSTs to share the RAM and network bandwidth, so clients should deterministically round-robin reads for large files across replicas (e.g. every 1GB)
  • if there are a large number of equivalent replicas of a file (eg. more than 3?), clients should deterministically evenly distribute their selection of the mirror by e.g. their client NID modulo mirror count (e.g. see patch https://review.whamcloud.com/29136 lmv_select_statfs_mdt())

For selection of which mirror to use for writes, decided by the MDS:

  • any mirror marked LCME_FL_STALE must be avoided
  • any replica(s) marked LCME_FL_PREF_WR should be selected first
  • replicas on flash OSTs (OS_STATFS_NONROT) should be preferred
  • prefer OSTs that are close network-wise to the client (same LNet network, in case of local/remote mirrors)
  • writes to OSTs with higher performance/lower load should be preferred (LU-7880)
  • currently it is not totally safe for writes to the local NID (until we get a "no recovery sync write export" for local clients), so a remote OST should typically be used (LU-12722)
  • allow specifying a "domain" number for each OST (e.g. rack, PSU, network switch), and avoid allocating replicas from the same domain. "domain" is just an arbitrary integer assigned by the admin to indicate groups of OSTs that share a common point of failure
Comment by Joseph Gmitter (Inactive) [ 19/Dec/17 ]

Converted to an issue as it is not a subtask for phase 1 of FLR (LU-9771).

Generated at Sat Feb 10 02:32:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.