Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
While fixing LU-16501, the LOD QOS OST selection was disabled for OST inodes because this was causing the (more important) space balance to be incorrect. However, part of fixing that issue was left incomplete.
The weight and penalty calculations need to be changed to reduce/exclude the blocks or inodes, depending on which one is currently "unimportant", but enable them when they become the dominating factor. For example, on OSTs there are typically far more free inodes than space, so the free inodes should not affect the result when calculating the weight when there are more than 50% free OST inodes. Conversely, on the MDTs there is usually more free space than inodes, so the free space should not affect the weight when there is more than 50% free MDT blocks. However, in some situations (e.g. DoM or Changelogs filling MDT space, or very small objects on OSTs) these values may become important and cannot be ignored completely.
We cannot change the weight calculation to selectively add/remove the inodes/blocks completely, since that will change the "units" they are calculated in, and it may be more or less important for different OSTs depending on their free usage. I was thinking something along the following lines:
- for each statfs update the following metrics can be calculated once per OBD_STATFS call:
- calculate "filesystem bytes per inode" based on "tot_bpi = bytes_total / inodes_total" (this would match the "inode ratio" when an ldiskfs MDT or OST is formatted). I'm not totally convinced if this is needed, it depends on how the algorithm is implemented.
- calculate "current bytes per_inode" based on "cur_bpi = bytes_used / inodes_used" to determine how the filesystem is actually being used. For osd-zfs this will always match the "filesystem bytes per inode" since the inodes_total value is actually calculated based on cur_bpi and bytes_total.
- limit the contribution of the free inodes OR free bytes to the weight/penalty calculation based on how current average file size (cur_bpi) compares to the filesystem limits (tot_bpi).
it may be that the cur_bpi has to be adjusted when the filesystem is initially empty (e.g. because the only files in use are for internal config files and maybe the journal), but it may not be important in the long run unless this significantly reduces the relative weight of new/empty OSTs compared to old/full OSTs (where cur_bpi could accurately predict the expected object size). As soon as OST objects start being allocated on the OST the cur_bpi value will quickly start to approach the actual usage of the filesystem over the long term.
For example, the inode weight could be limited to ia = min(2 * bytes_avail / cur_bpi, inodes_free) >> 8 and the bytes weight should be limited to ba = min(2 * inodes_free * cur_bpi, bytes_avail) >> 16 (possibly with other scaling factors depending on OST count/size). These values represent how many inodes or bytes can expect to be allocated by new objects based on the historical average bytes-per-inode usage of the filesystem. If a target has mostly large objects, then cur_bpi would be large, so ia would be limited by the 2 * bytes_avail / cur_bpi part and it doesn't matter how many actually free inodes there are. Conversely, if cur_bpi is small (below tot_bpi means that the inodes would run out first) then 2 * bytes_avail / cur_bpi would be large and inodes_free would be the limiting factor for allocations. In the middle, if the average object size is close to the mkfs limits, then both the free inodes and bytes would be taken into account.
Finally, make a separate patch to add debugfs parameters to print weight/penalty/per-obj/per-oss for each OST/OSS in LOV. It probably makes sense to be in lod.*.qos_<something> for the default pool, and "lod.*.pool.<pool>.qos_<something>" for each pool. "<something>" might be "qos_tgt_weights" or similar? It could be a YAML formatted file and contain one line per target, and somehow also the per-OSS stats, but I don't have great ideas for this yet. Maybe the per-OSS info (accumulated server penalty and per-obj) duplicated on each target line for that server?
Attachments
Issue Links
- is related to
-
LU-16501 QOS allocator not balancing space enough
- Resolved