[LU-13363] unbalanced round-robin for object allocation in OST pool Created: 17/Mar/20  Updated: 16/Jun/22  Resolved: 06/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

Two OST pools with two different sizes of OSTs within the same filesystem


Issue Links:
Related
is related to LU-9 Optimize weighted QOS Round-Robin all... Open
is related to LU-9392 lfs migrate -o and lfs setstripe -o s... Open
is related to LU-13066 RR vs. QOS allocator should be tracke... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here is an example. create two OST pools with 12 OSTs. pool 'nvme' consists of OST index[0-7].

lctl pool_new scratch.nvme
lctl pool_new scratch.hdd
lctl pool_add scratch.nvme OST[0-7]
lctl pool_add scratch.hdd OST[8-b]

If an client creates 48 files (new files) into an directory which is associated with 8 OSTs by OST pool, it would expect 6 OST objects per OST, but results was totally unbalanced.
Test was repeated 5 times, and here is a result how many OST objects allocated to each OST in each test.

Used 8 of 12 OSTs with an OST pool

      ost index 
    0  1 2 3 4 5  6  7
t1. 4 10 3 8 5 6  8  4
t2. 6  5 6 7 8 4 10  2
t3. 3 10 8 6 5 9  6  1
t4. 4 10 6 5 4 6  8  5
t5. 6  6 7 4 6 5  8  6

If filesystem created on just 8 OSTs and no OST pool, OST objects were allocated to across 8 OSTs in an balanced and round-robin worked perfectly.

Just 8 OST without OST pool

      ost index 
    0 1 2 3 4 5 6 7
t1. 6 6 6 6 6 6 6 6
t2. 6 6 6 6 6 6 6 6
t3. 6 6 6 6 6 6 6 6
t4. 6 6 6 6 6 6 6 6
t5. 6 6 6 6 6 6 6 6


 Comments   
Comment by Andreas Dilger [ 17/Mar/20 ]

Presumably the OST0008-OST000B size is much different than OST0000-OST0007? It might be that the pool allocation is incorrectly using QOS because the global OST imbalance, even though the OSTs within the pool are still balanced. If you configure with only the NVMe OST0000-OST0007, but create the pool on only 6 of them, is the allocation balanced?

Comment by Andreas Dilger [ 17/Mar/20 ]

It looks like this may be a duplicate of LU-13066.

Comment by Shuichi Ihara [ 17/Mar/20 ]

Yes, if all OSTs are same capacity and created an OST pool from few OSTs, it's balanced very well. if different capacity of OST are mixed in filesystem, it causes problem even it creates OST pool on same capacity of devices.

Comment by Andreas Dilger [ 17/Mar/20 ]

Notes for fixing this issue from LU-13066:

The ltd->ltd_qos.lq_same_space boolean that decides whether the LOD QOS allocator is active for an allocation or not is tracked for the entire LOV, when it should actually be tracked on a per-pool basis.

Consider the case where there are SSD of 1TB in size (in an ssd pool), and HDD OSTs of 100TB in size (in an hdd pool). In a newly-formatted filesystem, it is clear that the SSD OSTs would have 1% of the free space of the HDD OSTs, and lq_same_space=0 is set in ltd_qos_penalties_calc(). As a result, QOS would always be active and the SSDs would be skipped for virtually all normal (default pool) allocations, unless the ssd pool is specifically requested. That is fine (even desirable) for the default all-OST pool.

Now, if an allocation is using either the ssd or hdd pools, lod_ost_alloc_qos() will find the global lq_same_space=0 and not use RR allocation, but less-optimal QOS space weighted allocation, even though the space of OSTs in either pool may be well balanced. Instead, the lq_same_space flag should be kept on struct lu_tgt_pool so that allocations within a given pool can decide for RR or QOS allocation independently of the global pool.

Comment by Alex Zhuravlev [ 31/Mar/20 ]

it sounds like each pool needs own lu_qos and all logic should be built around that per-pool structure?
what if some OST is a member of few pools? or some pool-less allocation hits in-some-pool OSTs?

Comment by Andreas Dilger [ 31/Mar/20 ]

There are definitely going to be OSTs in multiple pools, and allocations that are outside pools. I think there should be common data fields, like OST fullness, that are shared across pools, and other per-pool information that is not shared.

I don't think we need to have totally perfect coordination between allocations in two different pools or in a pool and outside the pool. However, simple decisions like "is this pool within qos_threshold_rr" can be easily checked for all of the OSTs in the pool, regardless of whether the OST is in another pool as well. If the pool is balanced, then it should just do round-robin allocations within that pool.

Comment by Emoly Liu [ 03/Apr/20 ]

I made a patch to calculateĀ penalties per-ost in a pool. At first, I tried to add qos structure to pool_desc, similar idea to Alex's, but finally I found we don't need that because what we want is just to rebalance data in this pool each time.

Here is my test on 6 OSTs. pool1 is on OST[0-3] and OST[0-3] have similar available space, as follows. Then, I created 48 files on them.

[root@centos7-3 tests]# lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-OST0000_UUID       325368      115908      182300  39% /mnt/lustre[OST:0]
lustre-OST0001_UUID       325368      126152      172056  43% /mnt/lustre[OST:1]
lustre-OST0002_UUID       325368      136388      161820  46% /mnt/lustre[OST:2]
lustre-OST0003_UUID       325368      131276      166932  45% /mnt/lustre[OST:3]
lustre-OST0004_UUID       325368       13512      284696   5% /mnt/lustre[OST:4]
lustre-OST0005_UUID       325368       13516      284692   5% /mnt/lustre[OST:5]

Without the patch, the files distribution is

OST0  OST1  OST2  OST3
13    11    14    10

With the patch,

OST0  OST1  OST2  OST3
12    12    12    12

I will submit this tentative patch later.

Comment by Alex Zhuravlev [ 03/Apr/20 ]

I think rebalancing on every allocation is too expensive.

Comment by Gerrit Updater [ 03/Apr/20 ]

Emoly Liu (emoly@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38136
Subject: LU-13363 lod: do object allocation in OST pool
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 882ae1e39b68ab0cdee78f7bb4e9152f4778e5b9

Comment by Gerrit Updater [ 06/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38136/
Subject: LU-13363 lod: do object allocation in OST pool
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e642e75cde0248eee30ca94aaeb81653db7f8d03

Comment by Peter Jones [ 06/Jun/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:00:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.