Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16501

QOS allocator not balancing space enough

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      The LOD QOS object allocator is not giving enough priority to free space on the OSTs. Even when the OST free space is significantly imbalanced (in the below example OST0001 has 1/4 or 1/5 of the free space as other OSTs) it is still being used for allocations far more often than it should:

      # lfs df
      UUID                   1K-blocks        Used   Available Use% Mounted on
      testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
      testfs-OST0000_UUID       313104       86656      199288  31% /mnt/testfs[OST:0]
      testfs-OST0001_UUID       313104      242304       43640  85% /mnt/testfs[OST:1]
      testfs-OST0002_UUID       313104       77444      208500  28% /mnt/testfs[OST:2]
      testfs-OST0003_UUID       313104      118404      167540  42% /mnt/testfs[OST:3]
      
      filesystem_summary:      1252416      524808      618968  46% /mnt/testfs
      
      # for F in /mnt/testfs/junk3.{21..30}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
      # lfs getstripe /mnt/testfs/junk3.{21..30} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
            8 0,
            6 1,
            8 2,
            8 3,
      

      When the free space is even more imbalanced (1/11, 1/17, 1/20 of the other OSTs, and after at least qos_maxage) the OST selection continues to select the nearly full OST0001 and eventually runs out of space even though the 3-stripe layout could be satisfied from the other OSTs:

      # lfs df
      UUID                   1K-blocks        Used   Available Use% Mounted on
      testfs-MDT0000_UUID       125056        2276      111544   2% /mnt/testfs[MDT:0]
      testfs-OST0000_UUID       313104      109184      176760  39% /mnt/testfs[OST:0]
      testfs-OST0001_UUID       313104      277120        8824  97% /mnt/testfs[OST:1]
      testfs-OST0002_UUID       313104      128644      157300  45% /mnt/testfs[OST:2]
      testfs-OST0003_UUID       313104      183940      102004  65% /mnt/testfs[OST:3]
      
      filesystem_summary:      1252416      698888      444888  62% /mnt/testfs
      
      # for F in /mnt/testfs/junk3.{31..40}; do lfs setstripe -E 1M -c 1 -E 16M -c 1 -E eof -c 1 $F; fallocate -l 17M $F; done
      fallocate: fallocate failed: No space left on device
      # lfs getstripe /mnt/testfs/junk3.{31..40} | awk '/l_ost_idx/ { print $5 }' | sort | uniq -c
           11 0,
            5 1,
            6 2,
            8 3,
      

      I think there are a few contributing factors here that need to be fixed:

      • the inode usage on all of the OSTs is virtually identical and very low (only 1%), so the patch https://review.whamcloud.com/35219 "LU-11213 lod: share object alloc QoS code with LMV" is may be confusing the proper OST selection when the average object size is much larger than what is needed to fill the OST. However, having both inode and space weighting makes sense for DoM, and ldiskfs OSTs where the average file size is at or below the bytes/inode ratio (where selecting OSTs with free inodes become more important than free blocks). For ZFS the free blocks and free inodes are directly proportional to each other so this has no effect.
      • having multiple OSTs on the same OSS (as in this case) is using part of the QOS weighting from the total OSS free space, so this artificially boosts the weight of the full OST when it should not be. It might make sense to change the QOS algorithm to keep the OSS weight separate from the OST weight, and then select first the OSS and then the OST within the OSS. That would maximize the per-OSS bandwidth while avoiding OSTs on that OSS that are full.

      Attachments

        Issue Links

          Activity

            [LU-16501] QOS allocator not balancing space enough

            I filed LU-17614 to complete the work to fix the OST allocator, which currently only takes free blocks into account, and does not consider free inodes at all.

            adilger Andreas Dilger added a comment - I filed LU-17614 to complete the work to fix the OST allocator, which currently only takes free blocks into account, and does not consider free inodes at all.
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50074/
            Subject: LU-16501 lod: add qos_ost_weights to debugfs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a44956f0d57d45109959fc83a32764628adf4446

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50074/ Subject: LU-16501 lod: add qos_ost_weights to debugfs Project: fs/lustre-release Branch: master Current Patch Set: Commit: a44956f0d57d45109959fc83a32764628adf4446

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49977/
            Subject: LU-16501 tgt: add qos debug
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5fe45f0ff98064561be2ea584879440c26dd0334

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49977/ Subject: LU-16501 tgt: add qos debug Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5fe45f0ff98064561be2ea584879440c26dd0334

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50074
            Subject: LU-16501 lod: add qos_ost_weights to debugfs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4cec0e123674c5d1eb7c902343aa3f95bf7053bf

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50074 Subject: LU-16501 lod: add qos_ost_weights to debugfs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4cec0e123674c5d1eb7c902343aa3f95bf7053bf

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49890/
            Subject: LU-16501 tgt: skip free inodes in OST weights
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 511bf2f4ccd1482d6f2380942d43cc3e08b8e25b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49890/ Subject: LU-16501 tgt: skip free inodes in OST weights Project: fs/lustre-release Branch: master Current Patch Set: Commit: 511bf2f4ccd1482d6f2380942d43cc3e08b8e25b

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49977
            Subject: LU-16501 tgt: add qos debug
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ccc4eaeaed4b22effc0655debd6dc71e0618f97f

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49977 Subject: LU-16501 tgt: add qos debug Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ccc4eaeaed4b22effc0655debd6dc71e0618f97f

            Sergey, based on our previous discussion I think the next steps here are as follows.

            Push your patch to add debug lines for QOS allocator to print weights, and do minor cleanups of the other debugging (e.g. set DEBUG_SUBSYSTEM to S_LOV in all relevant code (in particular lustre/obdclass/lu_tgt_*.c) so that it can be enabled to capture only QOS debugging.

            Patch the weight and penalty calculation to reduce/exclude the blocks or inodes, depending on which one is currently "unimportant". For example, on OSTs there are typically far more free inodes than space, so the free inodes should not affect the result when calculating the weight. Conversely, on the MDTs there is usually more free space than inodes, so the free space should not affect the weight. However, in some situations (e.g. DoM or Changelogs filling MDT space, or very small objects on OSTs) these values may become important and cannot be ignored completely as in my 49890 patch.

            We cannot change the weight calculation to selectively add/remove the inodes/blocks completely, since that will change the "units" they are calculated in, and it may be more or less important for different OSTs depending on their free usage. I was thinking something along the following lines:

            • for each statfs update the following metrics can be calculated once per OBD_STATFS call:
            • calculate "filesystem bytes per inode" based on "tot_bpi = bytes_total / inodes_total" (this would match the "inode ratio" when an ldiskfs MDT or OST is formatted). I'm not totally convinced if this is needed, it depends on how the algorithm is implemented.
            • calculate "current bytes per_inode" based on "cur_bpi = bytes_used / inodes_used" to determine how the filesystem is actually being used. For osd-zfs the
            • limit the contribution of the free inodes OR free bytes to the weight/penalty calculation based on how current average file size (cur_bpi) compares to the filesystem limits (tot_bpi).
            • it may be that the cur_bpi has to be adjusted when the filesystem is initially empty (e.g. because the only files in use are for internal config files and maybe the journal), but it may not be important in the long run unless this significantly reduces the relative weight of new/empty OSTs compared to old/full OSTs (where cur_bpi could accurately predict the expected object size). As soon as OST objects start being allocated on the OST the cur_bpi value will quickly start to approach the actual usage of the filesystem oer the long term.

            For example, the inode weight could be limited to ia = min(2 * bytes_avail / cur_bpi, inodes_free) >> 8 and the bytes weight should be limited to ba = min(2 * inodes_free * cur_bpi, bytes_avail) >> 16 (possibly with other scaling factors depending on OST count/size). These values represent how many inodes or bytes can expect to be allocated by new objects based on the historical average bytes-per-inode usage of the filesystem. If a target has mostly large objects, then cur_bpi would be large, so ia would be limited by the 2 * bytes_avail / cur_bpi part and it doesn't matter how many actually free inodes there are. Conversely, if cur_bpi is small (below tot_bpi means that the inodes would run out first) then 2 * bytes_avail / cur_bpi would be large and inodes_free would be the limiting factor for allocations. In the middle, if the average object size is close to the mkfs limits, then both the free inodes and bytes would be taken into account.

            Finally, make a separate patch to add debugfs parameters to print weight/penalty/per-obj/per-oss for each OST/OSS in LOV. It probably makes sense to be in lod.*.qos_<something> for the default pool, and "lod.*.pool.<pool>.qos_<something>" for each pool. "<something>" might be "qos_tgt_weights" or similar? It could be a YAML formatted file and contain one line per target, and somehow also the per-OSS stats, but I don't have great ideas for this yet. Maybe the per-OSS info (accumulated server penalty and per-obj) duplicated on each target line for that server?

            adilger Andreas Dilger added a comment - Sergey, based on our previous discussion I think the next steps here are as follows. Push your patch to add debug lines for QOS allocator to print weights, and do minor cleanups of the other debugging (e.g. set DEBUG_SUBSYSTEM to S_LOV in all relevant code (in particular lustre/obdclass/lu_tgt_*.c ) so that it can be enabled to capture only QOS debugging. Patch the weight and penalty calculation to reduce/exclude the blocks or inodes, depending on which one is currently "unimportant". For example, on OSTs there are typically far more free inodes than space, so the free inodes should not affect the result when calculating the weight. Conversely, on the MDTs there is usually more free space than inodes, so the free space should not affect the weight. However, in some situations (e.g. DoM or Changelogs filling MDT space, or very small objects on OSTs) these values may become important and cannot be ignored completely as in my 49890 patch. We cannot change the weight calculation to selectively add/remove the inodes/blocks completely, since that will change the "units" they are calculated in, and it may be more or less important for different OSTs depending on their free usage. I was thinking something along the following lines: for each statfs update the following metrics can be calculated once per OBD_STATFS call: calculate "filesystem bytes per inode" based on " tot_bpi = bytes_total / inodes_total " (this would match the "inode ratio" when an ldiskfs MDT or OST is formatted). I'm not totally convinced if this is needed, it depends on how the algorithm is implemented. calculate "current bytes per_inode" based on " cur_bpi = bytes_used / inodes_used " to determine how the filesystem is actually being used. For osd-zfs the limit the contribution of the free inodes OR free bytes to the weight/penalty calculation based on how current average file size ( cur_bpi ) compares to the filesystem limits ( tot_bpi ). it may be that the cur_bpi has to be adjusted when the filesystem is initially empty (e.g. because the only files in use are for internal config files and maybe the journal), but it may not be important in the long run unless this significantly reduces the relative weight of new/empty OSTs compared to old/full OSTs (where cur_bpi could accurately predict the expected object size). As soon as OST objects start being allocated on the OST the cur_bpi value will quickly start to approach the actual usage of the filesystem oer the long term. For example, the inode weight could be limited to ia = min(2 * bytes_avail / cur_bpi, inodes_free) >> 8 and the bytes weight should be limited to ba = min(2 * inodes_free * cur_bpi, bytes_avail) >> 16 (possibly with other scaling factors depending on OST count/size). These values represent how many inodes or bytes can expect to be allocated by new objects based on the historical average bytes-per-inode usage of the filesystem. If a target has mostly large objects, then cur_bpi would be large, so ia would be limited by the 2 * bytes_avail / cur_bpi part and it doesn't matter how many actually free inodes there are. Conversely, if cur_bpi is small (below tot_bpi means that the inodes would run out first) then 2 * bytes_avail / cur_bpi would be large and inodes_free would be the limiting factor for allocations. In the middle, if the average object size is close to the mkfs limits, then both the free inodes and bytes would be taken into account. Finally, make a separate patch to add debugfs parameters to print weight/penalty/per-obj/per-oss for each OST/OSS in LOV. It probably makes sense to be in lod.*.qos_<something> for the default pool, and " lod.*.pool.<pool>.qos_<something> " for each pool. " <something> " might be " qos_tgt_weights " or similar? It could be a YAML formatted file and contain one line per target, and somehow also the per-OSS stats, but I don't have great ideas for this yet. Maybe the per-OSS info (accumulated server penalty and per-obj) duplicated on each target line for that server?

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49890
            Subject: LU-16501 tgt: skip free inodes in OST weights
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cb46a35534548a3fe64763d217e426ab70a06ef4

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49890 Subject: LU-16501 tgt: skip free inodes in OST weights Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cb46a35534548a3fe64763d217e426ab70a06ef4

            On a related note, I've wondered for some time if the QOS space balancing shouldn't be a bit more aggressive than it currently is. The algorithm essentially prioritizes OST selection by the ratio of free space between the OSTs (e.g. if OST0001 has 2x free space as OST0003, then OST0001 would get approximately 2x the allocations over time). However, this essentially means that the space will be balanced when all of the OSTs hit 100% full (modulo the fact that QOS is disabled when the OSTs return within qos_threshold_rr percent free space of each other).

            One change could be to change the weighting affected by qos_prio_free (prio_wide in the code) so that qos_prio_free=100 is 100% weighted by OST free space (and nothing related to OSS space), while still keeping qos_prio_free=0 to not be affected by OST/OSS free space at all.

            It may also be that the "penalty" values are far too large, and prevent allocations on less full OSTs too quickly, preventing QOS to effectively balance the space when OST objects are being allocated quickly.

            It may also makes sense to emphasize the free space balancing more aggressively when QOS is active, to target space equilibrium at about 80% full. That would have a dual purpose of reducing (though not eliminating) allocations on OSTs that are over 80%, while putting more emphasis on less full OSTs. Otherwise, the QOS balancing may never bring the OSTs into equilibrium under normal usage.

            adilger Andreas Dilger added a comment - On a related note, I've wondered for some time if the QOS space balancing shouldn't be a bit more aggressive than it currently is. The algorithm essentially prioritizes OST selection by the ratio of free space between the OSTs (e.g. if OST0001 has 2x free space as OST0003, then OST0001 would get approximately 2x the allocations over time). However, this essentially means that the space will be balanced when all of the OSTs hit 100% full (modulo the fact that QOS is disabled when the OSTs return within qos_threshold_rr percent free space of each other). One change could be to change the weighting affected by qos_prio_free ( prio_wide in the code) so that qos_prio_free=100 is 100% weighted by OST free space (and nothing related to OSS space), while still keeping qos_prio_free=0 to not be affected by OST/OSS free space at all. It may also be that the "penalty" values are far too large, and prevent allocations on less full OSTs too quickly, preventing QOS to effectively balance the space when OST objects are being allocated quickly. It may also makes sense to emphasize the free space balancing more aggressively when QOS is active, to target space equilibrium at about 80% full. That would have a dual purpose of reducing (though not eliminating) allocations on OSTs that are over 80%, while putting more emphasis on less full OSTs. Otherwise, the QOS balancing may never bring the OSTs into equilibrium under normal usage.

            People

              scherementsev Sergey Cheremencev
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: