[LU-11023] OST Pool Quotas Created: 16/May/18 Updated: 30/May/23 Resolved: 14/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DoM2, FLR2 | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
OST (or MDT) pool feature enables users to group OSTs together to make object placement more flexible which is a very useful mechanism for system management. However the pool support of quota is not completed now which limits the use of it. Luckily current quota framework is really powerful and flexible which makes it possible to add new extension. |
| Comments |
| Comment by Andreas Dilger [ 16/May/18 ] |
|
I've created a new issue for tracking pool quota, or possibly OST/MDT quota, since Having the ability to put separate quotas on OSTs/MDTs (either directly or via pools) is important for production deployment of both Data-on-MDT to limit space usage on MDTs, as well as FLR for burst-buffer implementation to limit usage on flash OSTs. I'm not fixed on linking this quota to OST pools since there are some complexities there, and we'd also want to have MDT pools for that to be useful for DoM, but I think some kind of limits are needed for these use cases. |
| Comment by Nathan Rutman [ 29/Aug/18 ] |
|
The design doc seems to have a problematic concept: a new EA with the "pool the object belongs to". The OST belongs to a pool (or pools), but the object does not belong to a pool itself. Or put another way, all objects on the OST belong to same set of pools that the OST belongs to. I guess the original idea in the doc was to try to make OST pools "look like" directory quotas by setting a new pool per directory, and so can't handle a single object in more than one pool. If we drop this idea, then we can drastically simplify the pool quotas design.
Why all that? Because all the handling of pool quotas can be confined to the quota master on the MDS. The OSTs just continue to request a single user or group quota from quota master. The master knows which pool(s) the OST is a member of, and just checks the quota for each pool, returning the minimum remaining amount for that OST to the slave. E.g. for this case there would be 4 quotas files created (2 per pool): admin_quotafile.usr, admin_quotafile.grp, admin_quotafile.usr.flash, admin_quotafile.grp.flash. For a quota acq request from an OST in the flash pool, the MDS would check all four files and return the minimal amount remaining. For an OST not in the flash pool, it would not check the .flash quotafiles. The more pools we have, the more quotafiles, and so quota checks will go incrementally slower, but I think this is acceptable. So after two hours of looking into this, I think this should be relatively easy to do. Am I missing something? |
| Comment by Andreas Dilger [ 30/Aug/18 ] |
|
It would be great if you are going to implement this feature. This is one of the major gaps for tiered storage being really usable within Lustre. If we have e.g. flash OSTs in the filesystem there is currently no way to exclude them from regular usage (e.g. if someone creates files without specifying a pool). Typically, the flash OSTs will also be smaller than the disk OSTs, so they will also fill up more quickly. Having the object allocator tied into pool quotas will ensure that the flash OSTs are skipped when a user doesn't have any remaining quota there, or was never given any in the first place. While the MDS allocation-time decision is not going to prevent all abuses (e.g. user creates a million small files in the flash pool, then tries to write lots of data into each one) it will at least avoid the majority of such issues. Conveniently, the "default quota" functionality ( Nathan, would you be able to write up a revised design doc that explains your proposed solution. It should include some reasonable use cases (in particular the tiered storage case with a flash OST pool and a disk OST pool that allows users some limited amount of space in the flash pool that can be time-limited to a short time like 24h). There also needs to be consideration on how the quota tools will be able to specify the quota limits and how this will integrate into the allocator on the MDS. |
| Comment by Nathan Rutman [ 30/Aug/18 ] |
|
It's on Cray's short list for implementation (Cray ticket LUS-5801). We considered including pool quotas in allocator decisions, but came to the conclusion that we should not: it was the user's decision to use this pool; it's not really the MDS's role to second-guess and use a different pool that what it was told. In any case, I'd prefer to get the pool quotas restrictions first (in this ticket), then consider the allocator changes as a follow-on. (Frankly, I think the allocator is in bad need of a complete rewrite in any case.) |
| Comment by Cory Spitz [ 09/Nov/18 ] |
|
I've added this LU to http://wiki.lustre.org/Projects. @sergey from Cray will be picking this up within the next month. |
| Comment by Nathan Rutman [ 03/Dec/18 ] |
|
A design question: When destroying a pool, does it make more sense to destroy all associated pool quota settings, or retain them in case the pool is recreated?
|
| Comment by Cory Spitz [ 03/Dec/18 ] |
|
What's the case for keeping them? If it is to make the user's life easier if/when a pool is recreated just give the user a tool to save the config. Remember, the system will still have to do some re-accounting when the quotas are respecified.
|
| Comment by Andreas Dilger [ 04/Dec/18 ] |
|
Cory, I think there are two separate issues here. There is the pool usage accounting, which is just based on the per-OST quota accounting, and is not accounted on a per-pool basis. There is a separate pool limit file, which is what the administrator specifies for each user (e.g. adilger gets 1TB in the flash pool), which contains potentially a lot of custom information and does not necessarily become stale when the pool name is removed. Given that the pool quota files are probably going to be relatively small, I'm not against keeping them if the pool is removed, so long as there is some way to actually delete the quota limits. Otherwise, I foresee that some admin will have a problem with their pool quota file, try to remove the pool and recreate it, and not be able to resolve their problem. |
| Comment by Sergey Cheremencev [ 21/Dec/18 ] |
|
Hello ! There are 2 ways:
|
| Comment by Andreas Dilger [ 21/Dec/18 ] |
|
IMHO it would be confusing/annoying to have to configure OST pools separately from pool quotas. People are already using OST pools for allocating files on specific OSTs, so having to define and configure pool quotas separately (and possibly differently by accident) would cause a lot of support issues/questions. Even though the quota master (MDT0000) is on a different device from the MGS, it should be that MDT0000 has all of the pool information because it is using the pools to do allocation. I think the biggest effort would be to allow MDTs to be added to pools, and have this affect inode allocation is the MDTs are added to a pool. |
| Comment by Sergey Cheremencev [ 25/Dec/18 ] |
|
Thanks for the answer, Andreas. Have one more item for discussion. ├── changelog_catalog ├── ... ├── quota_master │ ├── dt-0x0 │ │ ├── 0x1020000 │ │ ├── 0x1020000-OST0000_UUID │ │ ├── 0x1020000-OST0001_UUID Instead we can have something like ├── quota_master │ ├── dt-pool1 │ │... │ ├── dt-poolN However I guess pool_id could be useful later. Possibly to group disks directly on the OST. |
| Comment by Sergey Cheremencev [ 29/Dec/18 ] |
|
Please ignore my previous comment - it is not actual right now. |
| Comment by Andreas Dilger [ 29/Dec/18 ] |
|
According to Nathan's proposal, which I agree with, the concept of a pool quota would be something only understood by the MDT, basically adding the per-OST quotas together based on which OSTs belong in a pool. This would be similar to how "lfs df -p" works on the client, only adding the free space from OSTs that are part of a pool. |
| Comment by Nathan Rutman [ 14/Feb/19 ] |
|
I've attached our HLD for review (Sergey and I both worked on it). Please let us know any comments or concerns; we'll be implementing this shortly. |
| Comment by Patrick Farrell (Inactive) [ 14/Feb/19 ] |
|
Referring to the example about qunit calculations in: "5.2.2. Qunit changes " I won't quote the whole example. In the case described in the doc, of two overlapping pools, the user is close to out on one of those OSTs because of that pool. So it is affecting performance, but also, if they've got anything using that OST, then they're almost out of quota. And the striping policy doesn't take quota in to account, so files will get striped to that OST, so they will use it... So unless you're making special efforts to avoid it, you'll run out of quota there while using the other pool. So I don't think this is worse than today in ways that matter, and I think "do nothing" would be acceptable...? "Do nothing" with advice to avoid overlapping pools where possible?
|
| Comment by Sergey Cheremencev [ 15/Feb/19 ] |
As we can't use already existed "quota pools", new feature should have different name. Propose to name this "quota *s*pools"(slave pools). C02TM06XHTD6:quota c17829$ grep -R pool . | wc -l
383
What does community think about rename ? |
| Comment by Andreas Dilger [ 26/Feb/19 ] |
|
Will you be implementing MDT pools as part of this effort, or is that not in the current plans? |
| Comment by Sergey Cheremencev [ 26/Feb/19 ] |
|
MDT pools part is not in the current plans. I guess this work should be started with implementing MDT pools. Possibly we need some independent pools layer including both MDTs and MDTs pools that can be available from LOD, quota and MDD. |
| Comment by Andreas Dilger [ 26/Feb/19 ] |
|
Just reviewing the HLD, some general comments:
Hongchao, can you please take a look at the HLD and provide your input on the pool ID issue, as well as any other thoughts. |
| Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ] |
|
"I saw in the comments at one point that destroying a pool would preserve the quota limits, so they are available if the pool is recreated (which may happen if e.g. there is some problem with the config logs or similar), but this is not reflected in the HLD. IMHO, this behaviour makes sense, since assigning user/group/project quotas for pools is typically cumbersome work, and admins may not have scripts to do this or backups. My understanding is that setting quotas for user/group/project is already a bit of a chore, so we don't want to make it harder. If there is no pool definition, then the left-over pool quota limits could just be ignored completely (e.g. not loaded into memory)? Is there a reason this was removed?" I'm not sure of the details on its removal, but I said (at some point, possibly in discussions at Cray) that I thought this would potentially be confusing and have relatively little utility. Basically, what if the creator of a pool doesn't want quotas and doesn't realize they're re-using a name? It seems quite unpleasant to have "surprise" quotas. Then, also, when do we get rid of the pool quota info for pools that are gone? One of the ideas for pool quotas is that a workload manager or similar is creating them dynamically on a per-job basis, potentially both pools and pool quotas. So the old quotas could really pile up. Maybe they're so tiny it doesn't really matter... (Obviously, we could age them out or something, but it just adds complexity.) |
| Comment by Patrick Farrell (Inactive) [ 26/Feb/19 ] |
|
And, yeah, pool id is totally unused. I'm confused about how discussion of it crept back in to the design doc - Are you guys planning to implement pool ids? They effectively don't exist today, despite a little bit of old code for them still being present. |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
This would be a bad implementation, from a configuration point of view. Pools are stored in the Lustre config log, so that they are visible on the clients, but dynamically creating and removing the pools would quickly exhaust the available config llog space. I could see that the quotas might be changed on a per-job basis, but it seems unlikely that the hardware composing a pool would change frequently? If they really wanted per-OST quotas, then just configure one OST per pool and grant quota to the subset of OSTs that are desired. IMHO, in the longer term, it would be desirable to allow users to make their own "pseudo pools", either by allowing something like "lfs setstripe -o X,Y,Z -c1" to work in the same way as a pool (select 1 OST out of the list of OSTs "X, Y, Z"), or by leveraging https://review.whamcloud.com/28972 "LU-9982 lustre: Clients striping from mapped FID in nodemap" to allow creating "template" layout files in e.g. ~/.lustre/ostXYZ_1stripe (possibly using the above "pseudo pool") and then using it like "lfs setstripe -F ostXYZ_1stripe /path/to/new/file" to allow them to have named default layouts of their choosing. |
| Comment by Sergey Cheremencev [ 27/Feb/19 ] |
Yes, it was the main reason why we decided it is better to remove quota pool files together with appropriate pool.
No, we are not planning. Furthermore I've already started to write the code and place "new quota pools" in parallel with "old quota id pools". If we make a decision that existing quota pools could be removed, I stop my work and start with a patch that removes existing quota pools. It can save a lot of time because removing existing pools at final stage will need more effort. |
| Comment by Nathan Rutman [ 28/Feb/19 ] |
Right - this pools quota work is in no way related to hypothetical MDT pools
Right again. Although we agree this would be a nice feature, we are not lumping a big effort like changing the allocator into this ticket. The allocator needs to get some attention, but not only related to this:
So we are not going to mess with it for this ticket.
We removed this, as we felt that lingering settings would just be confusing. We don't really expect people to be destroying and then recreating pools, but normally if I destroy something I want it to be dead and gone, and part of the reason I am destroying and re-creating is to clear out something that was confusing/broken/unknown.
Yes, and we are actually thinking about another feature: default quotas. Right now, unset quotas means that there are no limits. Instead, we are thinking about defining a "default" quota user, such that if a user has no explicit quota setting, she gets the default quota. This could of course be set for a pool quota as well. But we will be working on this in a separate ticket; not here.
It is unused. It's going to take us some significant effort to remove, and will interrupt Sergey's current progress. We are willing to do this, and will include this as a first patch here. If anyone objects to this, PLEASE SPEAK UP NOW since we will shift to working on this immediately.
|
| Comment by Andreas Dilger [ 28/Feb/19 ] |
|
Note that there is already a mechanism in new Lustre release to have a default quota user. This was added in patch https://review.whamcloud.com/32306 " What is missing is a good way to backup and restore quota settings. |
| Comment by Sergey Cheremencev [ 04/Mar/19 ] |
|
I started thinking about existing quota pools removing.
So, what if I just will try to reuse existing quota pools for new feature purposes ? qmt_pool_info(used to describe currently quota pools) includes all needed for LOD quota pools(just several fieldes should be added). So no reasons to remove it and add the same structure but with another name. I believe the main part of functions from qmt_pool.c can also be used without big changes. If suggested way is acceptable, we will finally have following hierarchy: ├── quota_master │ ├── dt-0x0 │ │ ├── 0x1020000 │ │ ├── 0x1020000-OST0000_UUID │ │ ├── 0x1020000-OST0001_UUID │ │ ├── 0x20000 │ │ ├── .... │ ├── md-0x0 │ │ ├── 0x10000 │ │ ├── 0x1010000 │ │ ├── ... │ └── pools │ │ ├── pool1_usr │ │ ├── pool1_grp │ │ ├── pool1_prj │ │ ├── pool2_usr │ │ ├── ... ├── quota_slave │ ├── 0x10000 │ ├── 0x10000-MDT0000 │ ├── ... |
| Comment by Gerrit Updater [ 11/Mar/19 ] |
|
Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/34389 |
| Comment by Gerrit Updater [ 15/Apr/19 ] |
|
Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/34667 |
| Comment by Sergey Cheremencev [ 21/Jun/19 ] |
|
Hello ! Small update about my work on quota pools. The work still is in progress. But the main part is finished. I have a question about "lfs quota".
|
| Comment by Andreas Dilger [ 21/Jun/19 ] |
|
You could just stil with the long option "--pool". |
| Comment by Sergey Cheremencev [ 24/Jun/19 ] |
|
what about the 1st question ? Should lfs quota without key "–pool" show information for all existing pools ? |
| Comment by Andreas Dilger [ 25/Jun/19 ] |
|
I'm not really an expert in the quota code, but as long as this does not repeatedly list the OSTs, I think this would be OK. My concern would be if the output becomes too verbose, or if there isn't a way to limit the information to a specific quota type (maybe "--pool=none" to avoid the pool output)? |
| Comment by Gerrit Updater [ 25/Jul/19 ] |
|
Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/35615 |
| Comment by Gerrit Updater [ 09/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34389/ |
| Comment by Andreas Dilger [ 12/Dec/19 ] |
|
Since this work is already nearing completion, I'm wondering if there are additional developments in this area that you will pursue:
|
| Comment by Sergey Cheremencev [ 16/Dec/19 ] |
From my side I did all things to make the process of implementing MDT quota pools simple as possible. MDT pools look like a distinct feature. Suggest to discuss it in a another ticket. Possibly we can implement MDT pools only for DOM. Anyway I believe Cray is interesting to have pool quotas on MDT pools and I will have opportunity(need to get approvement from management) to be involved in this development process. Let's start discussing!
The key thing here is to provide quota pools state for usr/grp/prj to LOD layer. If OST belongs to a pool, LOD could ask QMT - does this user has quota at the pool. It looks like we need just to find lqe from global pool(qmt_pool_lqe_lookup(env, qmt, pooltype, qtype, id, NULL)) and check each entry in lqe global array for edquot. So if it is a simple patch, I can help to implement this. |
| Comment by Sergey Cheremencev [ 30/Mar/20 ] |
|
I am stuck with investigation of sanity-quota_69 failure. It fails only on configuration with 8 OSTs, 4 OSTs and 2 clients(review-dne-part-4). |
| Comment by Cory Spitz [ 30/Mar/20 ] |
|
mdiep, I heard that you might be able to assist with Sergey's request. Can you? |
| Comment by Andreas Dilger [ 30/Mar/20 ] |
|
sergey you could add debugging to the test script in your patch to dump the debug logs sooner (e.g. a background thread that calls "lctl dk /tmp/lustre-log-$(date +%s).log" every 5s for some time). I believe that Maloo will attach all "/tmp/*.log" files to the test results. |
| Comment by Minh Diep [ 30/Mar/20 ] |
|
spitzcor, I am not sure what you're taking about. |
| Comment by Sergey Cheremencev [ 01/Apr/20 ] |
I tried this approach but didn't get success. The latest sanity-quota_69 failure doesn't contain any debug logs I saved at tmp with name "$TMP/lustre-log-client-$(date +%s).log". Probably it should be similar with "sanity-quota.test_69.test_log.onyx-49vm1.log" ? If no, please advice another way. Thanks. |
| Comment by Andreas Dilger [ 01/Apr/20 ] |
|
Poking around a bit further, I see that lustre/tests/auster is uploading all of the logs from its $LOGDIR, and within test-framework.sh the generate_logname() function is using $LOGDIR/$TESTSUITE.$TESTNAME.$1.<hostname>.log for the individual logfiles. It looks like you could use "lctl dk $(generate_logname $(date +%s))" to dump the logs (similar to what gather_logs() does if an error is hit) and then they will be uploaded. James, Minh, Charlie, please correct me if the above is not corrent for log files to be included into the Maloo report for a test session. |
| Comment by Sergey Cheremencev [ 02/Apr/20 ] |
|
adilger, thank you for advice. However my last attempt when I used generate_logname also failed. The reason is not finally clear for me. At first look it doesn't relate to my patch - crash dump doesn't consist the reason of panic: crash> dmesg | tail -n 2 [ 1593.570869] Lustre: lustre-OST0001-osc-ffff8800a60bc800: disconnect after 21s idle [ 1593.573338] Lustre: Skipped 19 previous similar messages crash> sys | grep PANIC PANIC: "" On the other side it is occurred in sanity-quota_69 when it calls lctl dk - https://testing-archive.whamcloud.com/gerrit-janitor/7821/results.html Can someone assist me here ? |
| Comment by Andreas Dilger [ 02/Apr/20 ] |
|
Looking earlier in the test logs, I see a few other stack traces in the oleg308-server-console.txt from a special test run for this patch: [ 4326.625102] WARNING: CPU: 2 PID: 3431 at fs/proc/generic.c:399 proc_register+0x94/0xb0 [ 4326.627740] proc_dir_entry 'lustre-QMT0000/dt-qpool1' already registered [ 4326.640806] CPU: 2 PID: 3431 Comm: llog_process_th Kdump: loaded Tainted: P W OE ------------ 3.10.0-7.7-debug #1 [ 4326.644194] Call Trace: [ 4326.644610] [<ffffffff817d1711>] dump_stack+0x19/0x1b [ 4326.645525] [<ffffffff8108ba58>] __warn+0xd8/0x100 [ 4326.646338] [<ffffffff8108badf>] warn_slowpath_fmt+0x5f/0x80 [ 4326.649833] [<ffffffff812c2434>] proc_register+0x94/0xb0 [ 4326.650741] [<ffffffff812c2576>] proc_mkdir_data+0x66/0xa0 [ 4326.651683] [<ffffffff812c25e5>] proc_mkdir+0x15/0x20 [ 4326.652710] [<ffffffffa0315374>] lprocfs_register+0x24/0x80 [obdclass] [ 4326.653941] [<ffffffffa0aa2385>] qmt_pool_alloc+0x175/0x570 [lquota] [ 4326.655347] [<ffffffffa0aa74a4>] qmt_pool_new+0x224/0x4d0 [lquota] [ 4326.656901] [<ffffffffa032c83b>] class_process_config+0x22eb/0x2ee0 [obdclass] [ 4326.660700] [<ffffffffa032eec9>] class_config_llog_handler+0x819/0x14b0 [obdclass] [ 4326.662767] [<ffffffffa02f2582>] llog_process_thread+0x7d2/0x1a20 [obdclass] [ 4326.665703] [<ffffffffa02f4292>] llog_process_thread_daemonize+0xa2/0xe0 [obdclass] [ 4326.676370] LustreError: 3431:0:(qmt_pool.c:208:qmt_pool_alloc()) lustre-QMT0000: failed to create proc entry for pool dt-qpool1 (-12) [ 4326.680007] LustreError: 3431:0:(qmt_pool.c:935:qmt_pool_new()) Can't alloc pool qpool1 [ 4336.217899] LustreError: 3774:0:(qmt_pool.c:1343:qmt_pool_add_rem()) Can't add to lustre-OST0001_UUID pool qpool1, err -17 [ 4336.223934] LustreError: 3774:0:(qmt_pool.c:1343:qmt_pool_add_rem()) Skipped 5 previous similar messages so it may be that the code tries to register this same proc entry multiple times, and then crashes during cleanup when it is freed multiple times? |
| Comment by Sergey Cheremencev [ 03/Apr/20 ] |
In such case I'd expect to see the reason of failure in crash dump, smth like "BUG: unable to handle kernel NULL pointer". Anyway the reason is clear - I lost "dk" in my script causing timeout error: do_facet mds1 $LCTL > $(generate_logname $(date +%s)) |
| Comment by Gerrit Updater [ 14/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35615/ |
| Comment by Peter Jones [ 14/May/20 ] |
|
Landed for 2.14 |
| Comment by Cory Spitz [ 14/May/20 ] |
|
pjones and adilger, can we rename this ticket from "Add OST/MDT pool quota feature" to "OST Quota Pools"? The landed code doesn't include MDT pools and it is probably better to say OST pool quotas because we have user quotas, project quotas and pool quotas, not quota pools. |
| Comment by Peter Jones [ 14/May/20 ] |
|
I agree that this is more clear as to what is being provided in 2.14. Thanks for your attention to detail on this! |
| Comment by Cory Spitz [ 04/Sep/20 ] |
|
pjones, I'm afraid I didn't have the proper attention to detail after all! |
| Comment by Gerrit Updater [ 08/Oct/20 ] |
|
Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/40175 |
| Comment by Sergey Cheremencev [ 14/Oct/20 ] |
|
There is no special ticket about Pool Quotas testing results. |