Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13363

unbalanced round-robin for object allocation in OST pool

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.14.0
    • None
    • Two OST pools with two different sizes of OSTs within the same filesystem
    • 3
    • 9223372036854775807

    Description

      Here is an example. create two OST pools with 12 OSTs. pool 'nvme' consists of OST index[0-7].

      lctl pool_new scratch.nvme
      lctl pool_new scratch.hdd
      lctl pool_add scratch.nvme OST[0-7]
      lctl pool_add scratch.hdd OST[8-b]
      

      If an client creates 48 files (new files) into an directory which is associated with 8 OSTs by OST pool, it would expect 6 OST objects per OST, but results was totally unbalanced.
      Test was repeated 5 times, and here is a result how many OST objects allocated to each OST in each test.

      Used 8 of 12 OSTs with an OST pool

            ost index 
          0  1 2 3 4 5  6  7
      t1. 4 10 3 8 5 6  8  4
      t2. 6  5 6 7 8 4 10  2
      t3. 3 10 8 6 5 9  6  1
      t4. 4 10 6 5 4 6  8  5
      t5. 6  6 7 4 6 5  8  6
      

      If filesystem created on just 8 OSTs and no OST pool, OST objects were allocated to across 8 OSTs in an balanced and round-robin worked perfectly.

      Just 8 OST without OST pool

            ost index 
          0 1 2 3 4 5 6 7
      t1. 6 6 6 6 6 6 6 6
      t2. 6 6 6 6 6 6 6 6
      t3. 6 6 6 6 6 6 6 6
      t4. 6 6 6 6 6 6 6 6
      t5. 6 6 6 6 6 6 6 6
      

      Attachments

        Issue Links

          Activity

            [LU-13363] unbalanced round-robin for object allocation in OST pool

            I think rebalancing on every allocation is too expensive.

            bzzz Alex Zhuravlev added a comment - I think rebalancing on every allocation is too expensive.
            emoly.liu Emoly Liu added a comment -

            I made a patch to calculate penalties per-ost in a pool. At first, I tried to add qos structure to pool_desc, similar idea to Alex's, but finally I found we don't need that because what we want is just to rebalance data in this pool each time.

            Here is my test on 6 OSTs. pool1 is on OST[0-3] and OST[0-3] have similar available space, as follows. Then, I created 48 files on them.

            [root@centos7-3 tests]# lfs df
            UUID                   1K-blocks        Used   Available Use% Mounted on
            lustre-OST0000_UUID       325368      115908      182300  39% /mnt/lustre[OST:0]
            lustre-OST0001_UUID       325368      126152      172056  43% /mnt/lustre[OST:1]
            lustre-OST0002_UUID       325368      136388      161820  46% /mnt/lustre[OST:2]
            lustre-OST0003_UUID       325368      131276      166932  45% /mnt/lustre[OST:3]
            lustre-OST0004_UUID       325368       13512      284696   5% /mnt/lustre[OST:4]
            lustre-OST0005_UUID       325368       13516      284692   5% /mnt/lustre[OST:5]
            

            Without the patch, the files distribution is

            OST0  OST1  OST2  OST3
            13    11    14    10
            

            With the patch,

            OST0  OST1  OST2  OST3
            12    12    12    12
            

            I will submit this tentative patch later.

            emoly.liu Emoly Liu added a comment - I made a patch to calculate penalties per-ost in a pool. At first, I tried to add qos structure to pool_desc, similar idea to Alex's, but finally I found we don't need that because what we want is just to rebalance data in this pool each time. Here is my test on 6 OSTs. pool1 is on OST [0-3] and OST [0-3] have similar available space, as follows. Then, I created 48 files on them. [root@centos7-3 tests]# lfs df UUID 1K-blocks Used Available Use% Mounted on lustre-OST0000_UUID 325368 115908 182300 39% /mnt/lustre[OST:0] lustre-OST0001_UUID 325368 126152 172056 43% /mnt/lustre[OST:1] lustre-OST0002_UUID 325368 136388 161820 46% /mnt/lustre[OST:2] lustre-OST0003_UUID 325368 131276 166932 45% /mnt/lustre[OST:3] lustre-OST0004_UUID 325368 13512 284696 5% /mnt/lustre[OST:4] lustre-OST0005_UUID 325368 13516 284692 5% /mnt/lustre[OST:5] Without the patch, the files distribution is OST0 OST1 OST2 OST3 13 11 14 10 With the patch, OST0 OST1 OST2 OST3 12 12 12 12 I will submit this tentative patch later.

            There are definitely going to be OSTs in multiple pools, and allocations that are outside pools. I think there should be common data fields, like OST fullness, that are shared across pools, and other per-pool information that is not shared.

            I don't think we need to have totally perfect coordination between allocations in two different pools or in a pool and outside the pool. However, simple decisions like "is this pool within qos_threshold_rr" can be easily checked for all of the OSTs in the pool, regardless of whether the OST is in another pool as well. If the pool is balanced, then it should just do round-robin allocations within that pool.

            adilger Andreas Dilger added a comment - There are definitely going to be OSTs in multiple pools, and allocations that are outside pools. I think there should be common data fields, like OST fullness, that are shared across pools, and other per-pool information that is not shared. I don't think we need to have totally perfect coordination between allocations in two different pools or in a pool and outside the pool. However, simple decisions like "is this pool within qos_threshold_rr " can be easily checked for all of the OSTs in the pool, regardless of whether the OST is in another pool as well. If the pool is balanced, then it should just do round-robin allocations within that pool.

            it sounds like each pool needs own lu_qos and all logic should be built around that per-pool structure?
            what if some OST is a member of few pools? or some pool-less allocation hits in-some-pool OSTs?

            bzzz Alex Zhuravlev added a comment - it sounds like each pool needs own lu_qos and all logic should be built around that per-pool structure? what if some OST is a member of few pools? or some pool-less allocation hits in-some-pool OSTs?

            Notes for fixing this issue from LU-13066:

            The ltd->ltd_qos.lq_same_space boolean that decides whether the LOD QOS allocator is active for an allocation or not is tracked for the entire LOV, when it should actually be tracked on a per-pool basis.

            Consider the case where there are SSD of 1TB in size (in an ssd pool), and HDD OSTs of 100TB in size (in an hdd pool). In a newly-formatted filesystem, it is clear that the SSD OSTs would have 1% of the free space of the HDD OSTs, and lq_same_space=0 is set in ltd_qos_penalties_calc(). As a result, QOS would always be active and the SSDs would be skipped for virtually all normal (default pool) allocations, unless the ssd pool is specifically requested. That is fine (even desirable) for the default all-OST pool.

            Now, if an allocation is using either the ssd or hdd pools, lod_ost_alloc_qos() will find the global lq_same_space=0 and not use RR allocation, but less-optimal QOS space weighted allocation, even though the space of OSTs in either pool may be well balanced. Instead, the lq_same_space flag should be kept on struct lu_tgt_pool so that allocations within a given pool can decide for RR or QOS allocation independently of the global pool.

            adilger Andreas Dilger added a comment - Notes for fixing this issue from LU-13066 : The ltd->ltd_qos.lq_same_space boolean that decides whether the LOD QOS allocator is active for an allocation or not is tracked for the entire LOV, when it should actually be tracked on a per-pool basis. Consider the case where there are SSD of 1TB in size (in an ssd pool), and HDD OSTs of 100TB in size (in an hdd pool). In a newly-formatted filesystem, it is clear that the SSD OSTs would have 1% of the free space of the HDD OSTs, and lq_same_space=0 is set in ltd_qos_penalties_calc() . As a result, QOS would always be active and the SSDs would be skipped for virtually all normal (default pool) allocations, unless the ssd pool is specifically requested. That is fine (even desirable) for the default all-OST pool. Now, if an allocation is using either the ssd or hdd pools, lod_ost_alloc_qos() will find the global lq_same_space=0 and not use RR allocation, but less-optimal QOS space weighted allocation, even though the space of OSTs in either pool may be well balanced. Instead, the lq_same_space flag should be kept on struct lu_tgt_pool so that allocations within a given pool can decide for RR or QOS allocation independently of the global pool.
            sihara Shuichi Ihara added a comment - - edited

            Yes, if all OSTs are same capacity and created an OST pool from few OSTs, it's balanced very well. if different capacity of OST are mixed in filesystem, it causes problem even it creates OST pool on same capacity of devices.

            sihara Shuichi Ihara added a comment - - edited Yes, if all OSTs are same capacity and created an OST pool from few OSTs, it's balanced very well. if different capacity of OST are mixed in filesystem, it causes problem even it creates OST pool on same capacity of devices.

            It looks like this may be a duplicate of LU-13066.

            adilger Andreas Dilger added a comment - It looks like this may be a duplicate of LU-13066 .

            Presumably the OST0008-OST000B size is much different than OST0000-OST0007? It might be that the pool allocation is incorrectly using QOS because the global OST imbalance, even though the OSTs within the pool are still balanced. If you configure with only the NVMe OST0000-OST0007, but create the pool on only 6 of them, is the allocation balanced?

            adilger Andreas Dilger added a comment - Presumably the OST0008-OST000B size is much different than OST0000-OST0007? It might be that the pool allocation is incorrectly using QOS because the global OST imbalance, even though the OSTs within the pool are still balanced. If you configure with only the NVMe OST0000-OST0007, but create the pool on only 6 of them, is the allocation balanced?

            People

              bzzz Alex Zhuravlev
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: