Details
-
Story
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
From Andreas comment in LU-16501.
Patch the weight and penalty calculation to reduce/exclude the blocks or inodes, depending on which one is currently "unimportant". For example, on OSTs there are typically far more free inodes than space, so the free inodes should not affect the result when calculating the weight. Conversely, on the MDTs there is usually more free space than inodes, so the free space should not affect the weight. However, in some situations (e.g. DoM or Changelogs filling MDT space, or very small objects on OSTs) these values may become important and cannot be ignored completely as in my 49890 patch.
We cannot change the weight calculation to selectively add/remove the inodes/blocks completely, since that will change the "units" they are calculated in, and it may be more or less important for different OSTs depending on their free usage. I was thinking something along the following lines:
- for each statfs update the following metrics can be calculated once per OBD_STATFS call:
- calculate "filesystem bytes per inode" based on "tot_bpi = bytes_total / inodes_total" (this would match the "inode ratio" when an ldiskfs MDT or OST is formatted). I'm not totally convinced if this is needed, it depends on how the algorithm is implemented.
- calculate "current bytes per_inode" based on "cur_bpi = bytes_used / inodes_used" to determine how the filesystem is actually being used. For osd-zfs the
- limit the contribution of the free inodes OR free bytes to the weight/penalty calculation based on how current average file size (cur_bpi) compares to the filesystem limits (tot_bpi).
- it may be that the cur_bpi has to be adjusted when the filesystem is initially empty (e.g. because the only files in use are for internal config files and maybe the journal), but it may not be important in the long run unless this significantly reduces the relative weight of new/empty OSTs compared to old/full OSTs (where cur_bpi could accurately predict the expected object size). As soon as OST objects start being allocated on the OST the cur_bpi value will quickly start to approach the actual usage of the filesystem oer the long term.
For example, the inode weight could be limited to ia = min(2 * bytes_avail / cur_bpi, inodes_free) >> 8 and the bytes weight should be limited to ba = min(2 * inodes_free * cur_bpi, bytes_avail) >> 16 (possibly with other scaling factors depending on OST count/size). These values represent how many inodes or bytes can expect to be allocated by new objects based on the historical average bytes-per-inode usage of the filesystem. If a target has mostly large objects, then cur_bpi would be large, so ia would be limited by the 2 * bytes_avail / cur_bpi part and it doesn't matter how many actually free inodes there are. Conversely, if cur_bpi is small (below tot_bpi means that the inodes would run out first) then 2 * bytes_avail / cur_bpi would be large and inodes_free would be the limiting factor for allocations. In the middle, if the average object size is close to the mkfs limits, then both the free inodes and bytes would be taken into account.