[LU-13363] unbalanced round-robin for object allocation in OST pool - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.14.0
Labels:
None
Environment:
Two OST pools with two different sizes of OSTs within the same filesystem

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Here is an example. create two OST pools with 12 OSTs. pool 'nvme' consists of OST index[0-7].

lctl pool_new scratch.nvme
lctl pool_new scratch.hdd
lctl pool_add scratch.nvme OST[0-7]
lctl pool_add scratch.hdd OST[8-b]

If an client creates 48 files (new files) into an directory which is associated with 8 OSTs by OST pool, it would expect 6 OST objects per OST, but results was totally unbalanced.
Test was repeated 5 times, and here is a result how many OST objects allocated to each OST in each test.

Used 8 of 12 OSTs with an OST pool

      ost index 
    0  1 2 3 4 5  6  7
t1. 4 10 3 8 5 6  8  4
t2. 6  5 6 7 8 4 10  2
t3. 3 10 8 6 5 9  6  1
t4. 4 10 6 5 4 6  8  5
t5. 6  6 7 4 6 5  8  6

If filesystem created on just 8 OSTs and no OST pool, OST objects were allocated to across 8 OSTs in an balanced and round-robin worked perfectly.

Just 8 OST without OST pool

      ost index 
    0 1 2 3 4 5 6 7
t1. 6 6 6 6 6 6 6 6
t2. 6 6 6 6 6 6 6 6
t3. 6 6 6 6 6 6 6 6
t4. 6 6 6 6 6 6 6 6
t5. 6 6 6 6 6 6 6 6

Attachments

Issue Links

is related to

LU-9 Optimize weighted QOS Round-Robin allocator

Open

LU-9392 lfs migrate -o and lfs setstripe -o should pick OST from ost_list

Open

LU-13066 RR vs. QOS allocator should be tracked per OST pool

Resolved

Activity

[LU-13363] unbalanced round-robin for object allocation in OST pool

Alex Zhuravlev added a comment - 03/Apr/20 1:08 PM

I think rebalancing on every allocation is too expensive.

Alex Zhuravlev added a comment - 03/Apr/20 1:08 PM I think rebalancing on every allocation is too expensive.

Emoly Liu added a comment - 03/Apr/20 12:47 PM

I made a patch to calculate penalties per-ost in a pool. At first, I tried to add qos structure to pool_desc, similar idea to Alex's, but finally I found we don't need that because what we want is just to rebalance data in this pool each time.

Here is my test on 6 OSTs. pool1 is on OST[0-3] and OST[0-3] have similar available space, as follows. Then, I created 48 files on them.

[root@centos7-3 tests]# lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-OST0000_UUID       325368      115908      182300  39% /mnt/lustre[OST:0]
lustre-OST0001_UUID       325368      126152      172056  43% /mnt/lustre[OST:1]
lustre-OST0002_UUID       325368      136388      161820  46% /mnt/lustre[OST:2]
lustre-OST0003_UUID       325368      131276      166932  45% /mnt/lustre[OST:3]
lustre-OST0004_UUID       325368       13512      284696   5% /mnt/lustre[OST:4]
lustre-OST0005_UUID       325368       13516      284692   5% /mnt/lustre[OST:5]

Without the patch, the files distribution is

OST0  OST1  OST2  OST3
13    11    14    10

With the patch,

OST0  OST1  OST2  OST3
12    12    12    12

I will submit this tentative patch later.

Emoly Liu added a comment - 03/Apr/20 12:47 PM I made a patch to calculate penalties per-ost in a pool. At first, I tried to add qos structure to pool_desc, similar idea to Alex's, but finally I found we don't need that because what we want is just to rebalance data in this pool each time. Here is my test on 6 OSTs. pool1 is on OST [0-3] and OST [0-3] have similar available space, as follows. Then, I created 48 files on them. [root@centos7-3 tests]# lfs df UUID 1K-blocks Used Available Use% Mounted on lustre-OST0000_UUID 325368 115908 182300 39% /mnt/lustre[OST:0] lustre-OST0001_UUID 325368 126152 172056 43% /mnt/lustre[OST:1] lustre-OST0002_UUID 325368 136388 161820 46% /mnt/lustre[OST:2] lustre-OST0003_UUID 325368 131276 166932 45% /mnt/lustre[OST:3] lustre-OST0004_UUID 325368 13512 284696 5% /mnt/lustre[OST:4] lustre-OST0005_UUID 325368 13516 284692 5% /mnt/lustre[OST:5] Without the patch, the files distribution is OST0 OST1 OST2 OST3 13 11 14 10 With the patch, OST0 OST1 OST2 OST3 12 12 12 12 I will submit this tentative patch later.

Andreas Dilger added a comment - 31/Mar/20 9:15 PM

There are definitely going to be OSTs in multiple pools, and allocations that are outside pools. I think there should be common data fields, like OST fullness, that are shared across pools, and other per-pool information that is not shared.

I don't think we need to have totally perfect coordination between allocations in two different pools or in a pool and outside the pool. However, simple decisions like "is this pool within qos_threshold_rr" can be easily checked for all of the OSTs in the pool, regardless of whether the OST is in another pool as well. If the pool is balanced, then it should just do round-robin allocations within that pool.

Andreas Dilger added a comment - 31/Mar/20 9:15 PM There are definitely going to be OSTs in multiple pools, and allocations that are outside pools. I think there should be common data fields, like OST fullness, that are shared across pools, and other per-pool information that is not shared. I don't think we need to have totally perfect coordination between allocations in two different pools or in a pool and outside the pool. However, simple decisions like "is this pool within qos_threshold_rr " can be easily checked for all of the OSTs in the pool, regardless of whether the OST is in another pool as well. If the pool is balanced, then it should just do round-robin allocations within that pool.

Alex Zhuravlev added a comment - 31/Mar/20 6:10 PM

it sounds like each pool needs own lu_qos and all logic should be built around that per-pool structure?
what if some OST is a member of few pools? or some pool-less allocation hits in-some-pool OSTs?

Alex Zhuravlev added a comment - 31/Mar/20 6:10 PM it sounds like each pool needs own lu_qos and all logic should be built around that per-pool structure? what if some OST is a member of few pools? or some pool-less allocation hits in-some-pool OSTs?

Andreas Dilger added a comment - 17/Mar/20 3:15 AM

Notes for fixing this issue from ~~LU-13066~~:

The ltd->ltd_qos.lq_same_space boolean that decides whether the LOD QOS allocator is active for an allocation or not is tracked for the entire LOV, when it should actually be tracked on a per-pool basis.

Consider the case where there are SSD of 1TB in size (in an ssd pool), and HDD OSTs of 100TB in size (in an hdd pool). In a newly-formatted filesystem, it is clear that the SSD OSTs would have 1% of the free space of the HDD OSTs, and lq_same_space=0 is set in ltd_qos_penalties_calc(). As a result, QOS would always be active and the SSDs would be skipped for virtually all normal (default pool) allocations, unless the ssd pool is specifically requested. That is fine (even desirable) for the default all-OST pool.

Now, if an allocation is using either the ssd or hdd pools, lod_ost_alloc_qos() will find the global lq_same_space=0 and not use RR allocation, but less-optimal QOS space weighted allocation, even though the space of OSTs in either pool may be well balanced. Instead, the lq_same_space flag should be kept on struct lu_tgt_pool so that allocations within a given pool can decide for RR or QOS allocation independently of the global pool.

Andreas Dilger added a comment - 17/Mar/20 3:15 AM Notes for fixing this issue from LU-13066 : The ltd->ltd_qos.lq_same_space boolean that decides whether the LOD QOS allocator is active for an allocation or not is tracked for the entire LOV, when it should actually be tracked on a per-pool basis. Consider the case where there are SSD of 1TB in size (in an ssd pool), and HDD OSTs of 100TB in size (in an hdd pool). In a newly-formatted filesystem, it is clear that the SSD OSTs would have 1% of the free space of the HDD OSTs, and lq_same_space=0 is set in ltd_qos_penalties_calc() . As a result, QOS would always be active and the SSDs would be skipped for virtually all normal (default pool) allocations, unless the ssd pool is specifically requested. That is fine (even desirable) for the default all-OST pool. Now, if an allocation is using either the ssd or hdd pools, lod_ost_alloc_qos() will find the global lq_same_space=0 and not use RR allocation, but less-optimal QOS space weighted allocation, even though the space of OSTs in either pool may be well balanced. Instead, the lq_same_space flag should be kept on struct lu_tgt_pool so that allocations within a given pool can decide for RR or QOS allocation independently of the global pool.

Shuichi Ihara added a comment - 17/Mar/20 1:56 AM - edited

Yes, if all OSTs are same capacity and created an OST pool from few OSTs, it's balanced very well. if different capacity of OST are mixed in filesystem, it causes problem even it creates OST pool on same capacity of devices.

Shuichi Ihara added a comment - 17/Mar/20 1:56 AM - edited Yes, if all OSTs are same capacity and created an OST pool from few OSTs, it's balanced very well. if different capacity of OST are mixed in filesystem, it causes problem even it creates OST pool on same capacity of devices.

Andreas Dilger added a comment - 17/Mar/20 1:31 AM

It looks like this may be a duplicate of ~~LU-13066~~.

Andreas Dilger added a comment - 17/Mar/20 1:31 AM It looks like this may be a duplicate of LU-13066 .

Andreas Dilger added a comment - 17/Mar/20 1:28 AM

Presumably the OST0008-OST000B size is much different than OST0000-OST0007? It might be that the pool allocation is incorrectly using QOS because the global OST imbalance, even though the OSTs within the pool are still balanced. If you configure with only the NVMe OST0000-OST0007, but create the pool on only 6 of them, is the allocation balanced?

Andreas Dilger added a comment - 17/Mar/20 1:28 AM Presumably the OST0008-OST000B size is much different than OST0000-OST0007? It might be that the pool allocation is incorrectly using QOS because the global OST imbalance, even though the OSTs within the pool are still balanced. If you configure with only the NVMe OST0000-OST0007, but create the pool on only 6 of them, is the allocation balanced?

People

Assignee:: Alex Zhuravlev

Reporter:: Shuichi Ihara

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/Mar/20 1:13 AM

Updated:: 20/May/25 12:37 AM

Resolved:: 06/Jun/22 1:29 PM