[LU-10070] PFL self-extending file layout Created: 04/Oct/17 Updated: 21/Dec/23 Resolved: 25/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Vitaly Fertman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | FLR2 | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
One interesting idea discussed at LAD was to have a PFL layout that is "self extending". For several use cases, such as HSM partial-file release/restore, partial-file migration to/from burst buffers, partial-file FLR resync it is advantageous to avoid the need to restore/migrate/resync the entire file at once, but rather only to process the required chunks of the file. Essentially, a PFL file would have the normal few components that define the start of the file (e.g. [0-32MB), [32MB-1GB), [1GB-16GB)) and they would be instantiated as with normal PFL today. What is new for "self-extending layouts" is that the last component becomes the template for additional components (as needed) rather than having a component to "EOF" that freezes the layout of the rest of the file. This avoids the overhead of explicitly specifying many identical components for the file, only in order to limit the size of the components that need to be processed. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 12/Oct/17 ] | |||||||
|
If the new component is added with the same layout from the last component, what's the point of having a new component? Would it be better to just extend the extent of the last component? | |||||||
| Comment by Andreas Dilger [ 13/Oct/17 ] | |||||||
|
As described above, there are several reasons for having separate components for the file. One would be to allow HSM files to be archived and restored incrementally. Another is to limit the amount of data that needs to be resync'd if a replicated file is modified, since only the stale component would need to be resync'd. | |||||||
| Comment by Nathan Rutman [ 30/Oct/17 ] | |||||||
|
Please see an alternate idea in | |||||||
| Comment by Nathan Rutman [ 25/Jan/18 ] | |||||||
|
Are you thinking this would add an explicit new component each time? Example layout [0-32MB), [32MB-1GB), [1GB-16GB), [SE +16GB)
So after file grows to 1TB, there will be ~64 components? Maybe the component definition could include a multiplier instead and we just increment that? Save ourselves the space in the layout description, and also re-use the OST list (so we're not further spreading the stripes around to new OSTs. Hmm, maybe we want this selectable, to rebalance or not?) [0-32MB), [32MB-1GB), [1GB-16GB), [SE +16GB)*64 | |||||||
| Comment by Andreas Dilger [ 26/Jan/18 ] | |||||||
|
My thought is that yes, it would add a new component each time, which also means new OST objects. That allows better space balance, if the last component is not fully stripes. It also allows independent handling of each component (migrate, resync, archive/restore, etc), which is the goal of this functionality. | |||||||
| Comment by Nathan Rutman [ 15/Feb/18 ] | |||||||
|
I compared this to the Not handled as well by 10070:
Not handled as well by 10169:
| |||||||
| Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ] | |||||||
|
Nathan, About He and I both had a thought, I'll quote (lightly edited for clarity): "Paf: Agreed. Vitaly noted that it should be possible to extend the existing final component without violating the rules that require "no changing of existing layout components", because it should not impact any ongoing I/O on any clients. The idea would be to combine these two things - When the client reached the end of the current layout, it would go the MDS and either get an extended version of the current component or a new component. We would need to make sure somehow the client knew to go look for more layout. It might be sufficient to just revoke the layout lock and then the client, when it needed layout information, would have to go get new layout information. That new information could either be an extension OR new components." "Paf: The other issue Ben and I discussed - which I'm wondering if you've got more on, Ben - is how to decide when to do spillover? If we're talking in terms of Ben and I discussed two ideas, which I hope I'll describe decently here: Nathan, do you have a good sense of whether or not the allocator could be retrofitted to serve that purpose and whether or not that's a good idea? Or just a reaction in general?" | |||||||
| Comment by Patrick Farrell (Inactive) [ 16/Feb/18 ] | |||||||
|
Cory pointed a flaw in this one, which I hadn't thought through (but I think the other folks here had.). Self-extending layouts do not play very nicely with PFL layouts, for certain values of "nicely". A self extending (or repeating) component must (more or less...) necessarily be the last component of a PFL file while it is being extended. That means in your usual model of a PFL file, striped something like this: The last component would have to be the one that was repeated or extended. This means that (at least as this is currently imagined) we would only change striping on files which were in their final component of the PFL layout. Just for reference, this is how I imagined implementing this: When we get to the end of 8 stripes, if we extended (rather than moved to another pool), it would look like this: If we repeated, it would look like this: The magic component could specify a pool to spill over to as well. | |||||||
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] | |||||||
|
A further thought. We (Cray) have toyed with the idea of an extent level tracked tiering, something to do caching with a small granularity. This is very similar to the idea Andreas describes, where layout components could be moved around independently, the problem is reasonable extent sizes for automated caching on a faster tier do not match up with reasonable component lengths. We can't have too many components, so they have to be large, but if we want to do extent based caching, extents need to be pretty small. The goals are not entirely dissimilar, but they don't lend themselves to the same implementation. Andreas - Are you really not worried about the implications for MDT storage space of having a layout component for every, say, 10 GiB? Just wondering what you're thinking there. | |||||||
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] | |||||||
|
I believe the current behavior of O_APPEND and PFL files makes self-extending layouts impossible (assuming O_APPEND support is required, which seems logical). | |||||||
| Comment by Andreas Dilger [ 21/Feb/18 ] | |||||||
|
To comment on the several different issues that were raised in the last few comments. In general, I don't see this proposal as the best possible solution to all issues, but it definitely can address a number of common issues, and wouldn't need a huge amount of effort to implement. Existing PFL clients should "just work" work with the proposed self-extending layout.
Yes, definitely this is something to think about. We can't efficiently handle arbitrarily large file layouts, so there is some tradeoff between the extent size and the number of components that can be added. This is also partly true of PFL files in general, but the expectation is that the last component would extend to EOF, so puts an upper limit on the layout size. With self-extending layouts, this would no longer be true, but we could probably handle this in a few ways:
This might need a small change on the client, but I don't think it is hard to handle. The client just needs to hold the layout lock, and whether there is a component that goes to layout EOF or not is irrelevant. It just needs to know where the actual file's end of data is and lock all of the objects at or beyond there, whether it is inside a fixed-length component or not. While it is holding the layout lock, no other client can instantiate a component with objects after the end of the file.
That is already implemented with the osp..reserved_mb_high and osp..reserved_mb_low tunables. The MDS will avoid allocation on the OSTs once they exceed the high watermark, until they drop back below the low watermark again.
This has been discussed for a long time already. Li Xi implemented a proposal in LU-9809 to allow userspace to periodically supply the scores for each OST, so that the weights can be managed in an arbitrarily complex manner. The actual OST selection would still need to be handled by the MDS in the kernel, since we can't depend on userspace to keep the weights updated for each file creation, but that would be enough for balancing space, performance, maintenance (RAID rebuild, etc). I think that is a wholly separate project, and would welcome feedback/assistance in LU-9809, but it shouldn't be mixed in here. | |||||||
| Comment by Ben Evans (Inactive) [ 28/Feb/18 ] | |||||||
|
I think I've been thinking of this slightly differently. For any component, you'd have a "chunk" size where you'd check to see if you need to switch layouts to a new set of OSTs. Most of the time, then answer to this is "no", so you'd extend the current extent one more chunk, to the end of the current component. The benefit of this is that for systems with MDT, SSD, HDD pools, you should stay within the performance pool as long as specified, without getting kicked to the end (slowest) component, though that probably needs to be an option if the current pool is full. In order to do this, the scoring for any chunk needs to check if it's bad enough to require a change, with the default of "no change". | |||||||
| Comment by Patrick Farrell (Inactive) [ 28/Feb/18 ] | |||||||
|
Ben, That's basically how I've been thinking of it too, but Andreas and I believe Nathan (in an internal ticket or conversation?) have suggested that we just use the threshholds for making that choice. So no scoring system, though something to do that could be integrated later. | |||||||
| Comment by Patrick Farrell (Inactive) [ 05/Mar/18 ] | |||||||
|
Posting some comments from Cray ticket. Iterating on a few aspects of this, I wanted to post an example of what a setstripe command could look like as a way to explain what I'm thinking and get feedback. Here's an example of a simple PFL command, I'll break this down first: The first component is 1 stripe, with a stripe size of 1 megabyte and an end of 1 megabyte, on pool "pool1". Here's an imagined setstripe command: The first component ends at 1 GB, has two stripes w/1 MB stripe size, and uses pool1. The second component is an "extension" component, which will never be instantiated. It ends at 100G (so the maximum size on pool1 would be 100G), and it will give extensions of 10g at a time. This is followed by the last component, which ends at infinity, and uses 4 stripes (1 MB stripe size) on pool2. The idea is that when we get to the end of component 1, we hit component 2, and the client sees the extension flag. It uses this to make an extension request to the server. The server can either return a new component one with an end at 11G, or it can refuse - Then it would extend component 3 (the one on pool2) forward to fill the space previously used by the "extension" component, removing that component. Eventually, when/if we reached the end of the extension space (component 2), we would also remove component 2 at that time, so component 1 (on pool1, now 101 GB in size) would be adjacent to the old component 3 (the component on pool2). Sparse files raise some interesting questions, which I'll talk about shortly. There are a few different choices to make. ---------------------- --------------------- So they're writing in to the "extension" space. Probably the simplest thing to do is just do the space check on the OSTs once, and extend the first component to 30GB, the next multiple of the extension size. This means that sparse files can kind of mess up this feature, by getting a lot of space allocated... But sparse files are already hard. Now consider one more example, where the user writes at an offset of 120 GB to that same file. Then, later, if we tried to extend component 1 and the space check failed (so we cannot extend it), we would destroy component 2 and extend component 3 to the end of component 1. So it would look like this, if component 1 were 30G at that time: component1[pool1]------ Then if we failed to extend component 1, we would change things to look like this: | |||||||
| Comment by Patrick Farrell (Inactive) [ 05/Mar/18 ] | |||||||
|
One note is that as proposed, this doesn't use the space in pool1 per-se. Each component will only use the space on its own OSTs. We could imagine a more complex re-layout-ing policy that, when it ran out of space/hit the threshhold on current OSTs, would create a new component in the "extension" space, using other OSTs from pool1. By the way, one interesting detail of the layout implementation as it works today is that it would probably be possible for clients that do not understand the extension flag to use the self-extending layouts anyway. Almost everything is server-side, clients basically bail and ask the server when they need new layout, and restart the i/o entirely, after getting a new layout. This gives a lot of power to the server, letting it do clever things without the client needing to know. Older clients wouldn't be able to create these layouts with setstripe, so it may not really matter, but... yeah. | |||||||
| Comment by Nathan Rutman [ 09/Mar/18 ] | |||||||
lfs setstripe -E1g -c2 -S1m --pool="pool1" -E 100g -e10g --component-flag extension -E -1 -c4 -S1m --pool="pool2" I don't like that we have a pre-set maximum size for component 2. Or maybe that's ok for some, but I want to be able to do this as well: lfs setstripe -E1g --pool="MDS" -E -1 -e10g --pool="flash" -E+400g -e100g --pool="10Krpm" -E -1 --pool="5Krpm" I want this to mean: Contrary to a previous comment, the extending component is not the last one. The use case here is we want to use all of flash, but we don't want to ENOSPC when we do run out. We need another component on a different pool, to be used only when we can no longer extend. The "+400g" business is new, but if we don't cap the size of component 2, it means we can't know an absolute end point for any subsequent components. So it's a relative max size instead of an absolute. If we don't implement this, then it seems that any following components need to have a "-1" end. | |||||||
| Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ] | |||||||
|
Ah, sure, that makes sense. Specifying the endpoint for the extension space would definitely not be mandatory, and, sure, we could use -1 for "no explicit limit". If we did that, then nothing beyond it would be reachable for instantiation - it would "cover up" all later components until we did something to explicitly change that. That makes sense to me. The E+400g component would have to be handled somewhat differently, it might be easiest to tag that component with a flag saying it's a "relative end", and adjust processing accordingly... In fact, actually, what you're describing would need a few more components, at least, as I imagined it - We could make "-e10g" shorthand for another component, but... Here's how I would write out the layout you described: lfs setstripe -E1g --pool="MDS"[<-component 0] -E 10g --pool="flash"[<-component 1] -E -1 -e10g --component-flag extension[<-component 2, extension space] -E 100g --pool="10Krpm"[<-component 3] -E+300g -e100g --component-flag extension[<-component 4, extension space] -E -1 --pool="5Krpm"[<-component 5] component 0 is the MDS component, component 1 is the flash component, component 2 describes the extension behavior of the flash component, component 3 is the 10K RPM component, 4 is the extension space component for that, and component 5 is the final component. That's obviously a lot more verbose, but it describes the extension space components as distinct components with their own limits, which is (I think) the easiest way to implement this. (The E+300g is deliberate, since it would be a 100g component with 300g of extension space. That may not be the best way to describe this.) We'd have to tweak things so we could add components after a -1 component, presumably only when such a component is tagged as extension space. That's also fine. Agreed that multiple extending components shouldn't add any particular complexity. (At least, no more than allowing any component to follow an extending one.) | |||||||
| Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ] | |||||||
|
Quick other thought: That's not ideal, but I don't see any way to do any better with extent based layouts and sparse files. | |||||||
| Comment by Nathan Rutman [ 09/Mar/18 ] | |||||||
-E -1 -e10g --component-flag extension Isn't --component-flag extension implied by -e10g? Your comment sounds like it mandates a minimum flash component of at least one in our example above by doing this -E10g --pool="flash" -E -1 -e10g --pool "flash". But I don't see why that can't just be -E-1 -e10g --pool "flash" (i.e. just the second half) which would be "zero or more 10g extendable segments". The extension behavior of a component is specified with -e option on the component itself. There might be zero instances of the extension (or component) if there is no space in the flash pool. I think this is both simpler to write, as well as more flexible behavior. (If you require one segment on flash, make two components, one fixed and one extensible. If not, make just the one extensible component.)
I am imagining we can normally ignore any components after a -1, and that if a -1 extendable runs out of space at the next extension request, then that -1 turns into a hard limit, and the next component is used. Hopefully this makes sense. Sparse files - if someone writes to offset X, we have to generate all extendable components up to and including X. Sadly, none of these will consume space in the components, meaning future writes at earlier points in the file may well run out of space - we can overallocate the pool with this. I'm not sure that | |||||||
| Comment by Patrick Farrell (Inactive) [ 10/Mar/18 ] | |||||||
|
"Isn't --component-flag extension implied by -e10g?" So to be clear (perhaps it already is, sorry if so), the suggested implementation here is one component which is (or, at least, could be) instantiated, trailed by another component which tracks the extension space (if we limited it) and also serves as our catch for the client This component will never be instantiated. When the client tries to do i/o to that component, it will ask the server to take care of instantiating it, and at that point, the server instead edits the extent of the previous component. This "ghost"/extension space component will disappear (ie we will delete it) when we stop extending the preceding real component. That is to say, internally, a component is not self-extending - It gets that from the ghost component that follows it. Nothing says we have to leak that "ghost" component to userspace, at least not in the setstripe command, though. -e10g is definitely a clearer interface for users than an explicit "extension" component. So that implementation doesn't exactly conflict with the idea of potentially having a component start at zero size and possibly never get extended (which makes sense - I hadn't thought of it that way, but it's clearly the way we'd want it to work), but... Hm. Actually, I suppose it would be fine. There are perhaps two choices: While a 0 length extent (we might flag it somehow to make it easier to ignore...) is going to be a bit odd, I really like the simplicity of keeping the extend behavior the same. | |||||||
| Comment by Nathan Rutman [ 12/Mar/18 ] | |||||||
|
@paf - so you're saying the ghost is a simple way to trap writes beyond end-of-allocated-layout. I suppose that's fine, bearing in mind that I really would like to keep the lfs command as simple as possible. Along those same lines, if they specify an extendable component, it should show up in the layout with lfs getstripe no matter what its size, i.e. including zero. @adilger Back on the question of whether to add new components at every extension or not, I would much rather not. Avoids the layout growth problem, and again keeps things simpler for users. If someone really wants to migrate chunks/smaller components later, they can manipulate the layout at that point. In any case, it doesn't seem right to overload this ticket (or Lustre layout handling) with that additional constraint. KISS for now, we can always revisit later. | |||||||
| Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ] | |||||||
|
I wanted to highlight an issue that diverges a little from what we've been discussing, but seems important. There is, today, no threshold for taking an OST out of striping rotation for "out of space" reasons. The current proposal relies on implementing something to do that. Probably not very difficult to do, just something I wanted to write down here. | |||||||
| Comment by Ben Evans (Inactive) [ 15/Mar/18 ] | |||||||
|
Patrick, I think that's a separate issue, and this can work well enough without that, simply having a good "fitness" check, where we see if it might help if we restripe. My assumption is that for any OST pool, this will give us better fullness levelling than we currently have, so rather than a single OST hitting full, you'd have all of the OSTs in a pool hitting it around the same time, and you'd be dealing more with a FS out of space than an OST out of space. | |||||||
| Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ] | |||||||
|
Ben, I know this is an area you've thought about more than I have... What would you suggest as a fitness check, specifically? (Or as specifically as you've considered it, at least | |||||||
| Comment by Ben Evans (Inactive) [ 15/Mar/18 ] | |||||||
|
Simple one: If no OST is more than 75% full, there's also no reason to change. Establish the median OST size. If, in the current stripe, 75% of the OSTs involved are under the median, that's good enough, and there is no reason to change. If part of the stripe is on the fullest OST, then it's time to change.
This brings up an odd question, does anyone actually use stripe offsets in production (like always starting at OST 0?) | |||||||
| Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ] | |||||||
|
Interesting. This bit:
Sounds similar to the QOS allocator, which is intended to avoid striping to an OST if it diverges too far from... I'm not sure if it's the least full OST or an average. Thresholds are set for that already. Could we leverage it, rather than replicating some part of its functionality? (I'm not sure we could, that's a serious question) | |||||||
| Comment by Ben Evans (Inactive) [ 15/Mar/18 ] | |||||||
|
I was thinking we'd call the current allocator (maybe slightly modified) for a new stripe if we need it. This would be about detecting if we needed a new stripe, and setting a relatively high threshold to actually change the stripe and add a segment. | |||||||
| Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ] | |||||||
|
OK, that makes sense. I'll keep thinking about it. I get nervous when you say "segment" - The current proposal is to add space to an existing layout component until this check we're discussing says "no", then we switch ("spill over") to the next layout component, which will already exist. Actually, as I say this, it occurs to me that we could put two back-to-back extendable components on the same pool, which would have the effect of "checking the rest of the pool" for space before spilling over to the next pool. (Basically we'd try to spill over from our current OSTs to elsewhere in the same pool before spilling over to a new pool. But we'd only do that once. [Spilling over to the same pool over and over would hit a pathological case where we'd end up with every file on almost every OST as the OSTs filled up.]) | |||||||
| Comment by Andreas Dilger [ 15/Mar/18 ] | |||||||
One clarification here - the second component inherits all of the values from the first component if they are unspecified, and the first component inherits them from the parent directory or root directory layout, if they exist, or filesystem-wide defaults if nothing else. There was a long discussion in
There is indeed such a mechanism, see osp.*.reserved_mb_high and osp.*.reserved_mb_low on the MDS, added in 2.9 by aboyko. The comment in the code for the high and low watermarks say (and this is even documented in the user manual!): * Show high watermark (in megabytes). If available free space at OST is greater * than high watermark and object allocation for OST is disabled, enable it. * Show low watermark (in megabytes). If available free space at OST is less * than low watermark, object allocation for OST is disabled. These values are initialized as a fraction of the OST size initially (0.2% and 0.1%, respectively), and can be set to an absolute value at runtime. I don't think they allow specifying a percentage at runtime, but that was important to you it might be possible for you to add a decimal value with a '%' unit? This allows the admin to set a low threshold below which an OST is no longer considered for allocation, then it is allowed to drain (to reduce free space fragmentation) until it hits the high watermark again. In terms of more sophisticated changes to the LOV object allocator, this is being discussed in LU-9 and LU-9809. DDN had some proposals on that front, and Nathan and I discussed it with Li Xi at LAD, but nothing further has come of it so far. Please move any allocator discussions over there, so hopefully it can be implemented at some point. I guess my main question here is what the main benefits of your current proposal are? It seems they are mostly focused on avoiding out-of-space on an OST? The proposal of extending existing components is quite different from my original proposal, where having separate components was desirable to allow partial-file release/restore, migration, and resync. My expectation that the out-of-space issues will already largely be handled by regular PFL layouts, as long as they are set up reasonably (i.e. in the neighborhood of component_end < reserved_mb_high * stripe_size, with a -c -1 component at the end) and the MDS allocator is working properly. IMHO, we could spend ages on making complex layouts (which users are relatively unlikely to understand or use), and would be far better off to improve the MDS object allocator to do a better job of balancing space between OSTs, but without the performance hit of the full-random QOS allocator that we have today. With a good allocator and PFL, the OSTs would always be evenly used, so any single OST is only going to fill up at the point when all of them fill up. We would need to take some care with different storage tiers/pools, possibly having the MDS drop some preliminary components completely at file creation time if their pool is close to full. We've already started discussing that in | |||||||
| Comment by Nathan Rutman [ 15/Mar/18 ] | |||||||
More specifically, avoiding out-of-space on a pool, while maximizing use of that pool. The scenario is that a flash pool should be used as long as there is space, but if it runs low we should change the layout. We can't specify this with PFL now. Can we do this with the allocator alone? No; the allocator is only used at create time, and maybe at first-write-into-component time. With that constraint, the only way to change a layout based on current fill is to add a new component; i.e. a self-extending layout description. This ticket is to me about changing the static PFL layout into a dynamic one.
Agreed, this is different. Yours has benefit as well, but I think the general problem is that people won't know ahead of time to make a SEPFL layout so that they can later migrate it. We really need a way of changing any layout after data is placed; SEPFL can't solve that, so I think that idea really belongs under a "restripe an existing file efficiently" ticket. My 2 cents. | |||||||
| Comment by Ben Evans (Inactive) [ 16/Mar/18 ] | |||||||
|
Why wouldn't SEPFL always be on with some reasonable minimum/default for the extension size (like 2GB)? | |||||||
| Comment by Nathan Rutman [ 16/Mar/18 ] | |||||||
|
Like any layout it would have to be specified. Sure, you can set it as a FS default, but if you know you have a large file you might specify wide striping from the start, or DoM, or anything else. You also might want your final layout to have components at different boundary points than your original. I think the "move an extent range to a different layout" problem is probably better handled with a range-aware FLR than depending on PFL components. FLR the segment you want to move into a secondary layout. Doesn't have to line up with a component boundary in the primary. | |||||||
| Comment by Ben Evans (Inactive) [ 16/Mar/18 ] | |||||||
|
well, the default stripe is 1 random OST, with a 1MB stripe size. There's no real reason we can't extend that to have a default SEPFL (or whatever we're calling it) chunk size. So it's always there, even if noone ever touches the striping, ever. You bake in the feature so that it is always-on. | |||||||
| Comment by Andreas Dilger [ 16/Mar/18 ] | |||||||
|
It is possible to set a filesystem-wide default PFL layout by setting it on the root directory. If no default is found on the parent directory, then the layout is gotten from the root directory (though soon this may also come from a "template FID" for nodemap+ subdir mount clients that don't have access to the root directory, see LU-9982). One open question that I'd like input on is how to handle layout inheritance from a composite file if one wants to specify some different parameters from the default? For simple layouts this is straight forward - fill in the parameters (stripe count, stripe size, pool) from the parent/root/fs if they are not specified for the new file. However, if the parent/root have a composite layout but the user also wants to create a composite file, what gets inherited? | |||||||
| Comment by Nathan Rutman [ 16/Mar/18 ] | |||||||
|
Inheritance - as you pointed out earlier, these layout settings are getting complex, what with PFL, FLR, SEPFL, DOM, pools, flash, etc. Soon, as I've mentioned elsewhere, some elements of policy are implied by layouts and perhaps should be made explicit. So here's a proposal: | |||||||
| Comment by Patrick Farrell (Inactive) [ 30/Mar/18 ] | |||||||
|
I've been a little quiet on this (working on prototype + HLD, etc), but I had a thought I wanted to describe here and get feedback on. I think Ben may have already been imagining something like this, based on some things he said the other week, but it's a new idea to me. Specifically, consider the behavior of a file with one self-extending component going to infinity: lfs setstripe -c 2 -E 100M -E -1 -e 10M --component-flags extension The first component will end up on two particular OSTs, and then when they fail an extension check (basically, when they're too full, for some definition of "too full"), we'll just stop extending it, and return -ENODATA (I think that's it) when the user tries to write to the part of the layout that doesn't exist. If there's another component following that extension space component, we would of course use that, consider this example with that additional component: lfs setstripe -c 2 -E 100M -E -1 -e 10M --component-flags extension -E -1 -c 2 --pool="pool2" The file layout would be the first component getting extended until it failed an extension check, then we'd initialize and use that final component on "pool2" for the rest of the layout. But what about that first case, where we don't have a component following the self-extending component + extension space? In that case, what if instead of just stopping, we made a new component with the same striping properties as the old one? Not the same OSTs, but the same striping properties - size, count, pool if specified. So the layout would look like this:
Then one of those OSTs comes up "full". The behavior we've described already would just remove that extension space, which would look like this:
And fail further writes to the file. But if we added a new component, like I suggested above, it could look like this:
And then we wouldn't have to stop writing to the file because the OSTs it was on are running out of space. This would only be desirable in cases where no further components were specified - If further components are specified, the behavior described in them should control. This could either be the default behavior for self-extending PFL components with no further components or it could be controlled by a flag. Thoughts? | |||||||
| Comment by Patrick Farrell (Inactive) [ 30/Mar/18 ] | |||||||
|
Note this behavior has one obvious downside, which is that if we're filling up all available OSTs (either the whole file system or a pool) we could "wander" between different OSTs, ending up with files with many small-ish components on many OSTs. I don't think this is enough of a problem that we should not do this, it's just something to consider. | |||||||
| Comment by Andreas Dilger [ 02/Apr/18 ] | |||||||
|
Two notes on practicality of usage here:
Note that the goal shouldn't be to fill OSTs completely before moving on to new OSTs, since that increases the chance that some other file will run out of space before it can move to a new component. Whether we hit out-of-space 1s earlier or later is not critical in most cases, since we would run out of space in any case. | |||||||
| Comment by Patrick Farrell (Inactive) [ 02/Apr/18 ] | |||||||
Absolutely. The 10 MiB examples have purely been to keep the numbers small, I was thinking 1 MiB minimum, 1 GiB if nothing was specified (maybe even 10 GiB?), and 10 or 100 GiB being typical. I suppose if I think 10 GiB is typical that should be the default, but that's all easy to adjust. And, yes to the rest of that as well. It's a tradeoff. Avoiding imbalance is good but also somewhat separate.
Yes, my intention is that it would be sticky, for the reasons you gave. And also, yes - we don't want to pick OSTs that are almost full.
Agreed on this point as well. Still chewing on details - this "policy" aspect of it should be relatively easy to tweak, though, both in terms of what we implement and potential tunables. | |||||||
| Comment by Patrick Farrell (Inactive) [ 09/May/18 ] | |||||||
|
A question, probably mostly for Andreas. This is something from one of the FLR design docs, describing userspace tooling: "A file with a simple layout is converted to a composite layout whose sole component is the previous layout." Has this been implemented/is it planned explicitly anywhere? It has some loose relevance here. | |||||||
| Comment by Andreas Dilger [ 10/May/18 ] | |||||||
|
Patrick, there was not a command line interface for doing this, as there was no value to do so. However, for FLR it is possible to take a plain layout and convert it to a component and then add a mirror to the file. | |||||||
| Comment by Patrick Farrell (Inactive) [ 10/May/18 ] | |||||||
|
Andreas, How do you take a plain layout and convert it to a component of a composite layout? It sounds like you're saying there is already a mechanism or doing so? Or are you saying it would be possible to create one? (In which case, yes, I agree and see roughly how it would work. I'm hoping someone has already done it. | |||||||
| Comment by Zhenyu Xu [ 10/May/18 ] | |||||||
|
Converting a plain layout to a component is only done as a LOD internal function in lod_layout_convert() which serves lfs_mirror_extend() to add a mirror to an existing plain file and constructs the file with two mirrors. | |||||||
| Comment by Patrick Farrell (Inactive) [ 30/May/18 ] | |||||||
|
Question for Jinshan or anyone else interested in this about the interaction with FLR. Today, lfs mirror resync (and the mirror_io test) invoke llapi_mirror_resync_one with the "end" of the region to resync set to the end of the relevant component of the mirror, without regard to file size. (They stop copying once "read" returns less than the # of bytes requested.) Is there any problem with using the logic of Instead? Here's a sample scenario. Mirror 1 is from 0 to EOF, it's preferred. We write 1 MiB to the (currently empty) file. Mirror 2 is now stale. Then we do lfs mirror resync. Currently, that results in us attempting a read to EOF, which will fully instantiate that self-extending layout. Alternately, if we use file size as the maximum for end, we will have an end at 1 MiB and we will not unnecessarily extend the self-extending layout. Is there any reason this wouldn't work/would be a bad idea? [I assume I may have to adjust some sanity tests.] Quick testing suggests it's fine, and without it, self-extending PFL layouts and "lfs mirror resync" won't work well together. | |||||||
| Comment by Andreas Dilger [ 31/May/18 ] | |||||||
|
Patrick, it isn't clear why the self-extending layout would be affected by reads? The resync shpildbt be writing any data beyond the file size, so that shouldn't cause the layout to be extended. | |||||||
| Comment by Patrick Farrell (Inactive) [ 11/Jun/18 ] | |||||||
|
Ah, yes - Sorry, I got this confused with a different stage of the operation which was causing the layout instantiation. Thanks. | |||||||
| Comment by Patrick Farrell (Inactive) [ 12/Jul/18 ] | |||||||
|
Attachment is a design doc for reference. Some small updates are planned, but it is largely complete. | |||||||
| Comment by Gerrit Updater [ 12/Jul/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/32812 | |||||||
| Comment by Joseph Gmitter (Inactive) [ 12/Jul/18 ] | |||||||
|
Hi Patrick, Any objection to keeping the design doc on the lustre.org wiki? I would be happy to port it over there into mediawiki format. Thanks. Joe | |||||||
| Comment by Patrick Farrell (Inactive) [ 12/Jul/18 ] | |||||||
|
Joe, None whatsoever. In fact, I was planning to get it there eventually. Would having it in another format than PDF - .docx is an easy option - be helpful? | |||||||
| Comment by Joseph Gmitter (Inactive) [ 12/Jul/18 ] | |||||||
|
.docx would be great. I use pandoc to convert the docx to get it most of the way there. | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33777 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33785 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33778 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33779 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33781 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33780 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33782 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33783 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33784 | |||||||
| Comment by Gerrit Updater [ 05/Dec/18 ] | |||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33786 | |||||||
| Comment by Patrick Farrell (Inactive) [ 05/Dec/18 ] | |||||||
|
There are two to-dos remaining (other than addressing reviewer comments, of course
But the code is complete & ready for reviewers to start looking at it. | |||||||
| Comment by Patrick Farrell (Inactive) [ 05/Dec/18 ] | |||||||
|
Note also that at least for now, the "trailing components" portion (mentioned in the design doc) is being left out. I may or may not resurrect that - It added a lot of complexity for maybe not enough benefit, and was preventing me from getting this finished. The feature is extremely useful without it, so I pushed it anyway. In any case, the trailing component support is implemented entirely on top of the current patch series, and can be added later if it looks manageable. | |||||||
| Comment by Cory Spitz [ 05/Dec/18 ] | |||||||
|
We're still holding to L2.13.0 per http://lustre.org/roadmap/ and http://wiki.lustre.org/Projects so I set the Fix Version/s field accordingly. | |||||||
| Comment by Patrick Farrell (Inactive) [ 13/Dec/18 ] | |||||||
|
Update here. The current version of the patches has no sanity test failures I am aware of. (Maloo is having some issues right now, but I didn't see any clear evidence the failures were caused by these patches.) The main remaining to-do is writing the man pages. I'll do that shortly. | |||||||
| Comment by Gerrit Updater [ 04/Jan/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33777/ | |||||||
| Comment by Gerrit Updater [ 12/Mar/19 ] | |||||||
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34406 | |||||||
| Comment by Gerrit Updater [ 16/Mar/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34406/ | |||||||
| Comment by Gerrit Updater [ 20/Mar/19 ] | |||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34470 | |||||||
| Comment by Gerrit Updater [ 21/Mar/19 ] | |||||||
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34470/ | |||||||
| Comment by Gerrit Updater [ 21/May/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/34909 | |||||||
| Comment by Patrick Farrell (Inactive) [ 21/May/19 ] | |||||||
|
A general comment: A good model is the lfsck tests added in the foreign layout patches: https://review.whamcloud.com/#/c/33755/ https://review.whamcloud.com/#/c/34087/ I can think of a few cases I'd want to test. 1. Just create an SEL layout, don't do anything to it (or write a little data, but don't use the SEL portion), run sanity-lfsck & verify no changes 3. Same, but "exhaust" an SEL component so you get component removal, then lfsck 4. Repeat a component and run sanity-lfsck | |||||||
| Comment by Cory Spitz [ 06/Jun/19 ] | |||||||
|
http://wiki.lustre.org/Release_2.13.0 has been updated with reviewers and testers of record. | |||||||
| Comment by Gerrit Updater [ 10/Jun/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35144 | |||||||
| Comment by Gerrit Updater [ 11/Jun/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35182 | |||||||
| Comment by Gerrit Updater [ 14/Jun/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35232 | |||||||
| Comment by Gerrit Updater [ 19/Jun/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35270 | |||||||
| Comment by Gerrit Updater [ 25/Jun/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35232/ | |||||||
| Comment by Gerrit Updater [ 25/Jun/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35314 | |||||||
| Comment by Gerrit Updater [ 27/Jun/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35270/ | |||||||
| Comment by Gerrit Updater [ 03/Jul/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35414 | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33778/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33780/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33782/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33783/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33784/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33785/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33786/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34909/ | |||||||
| Comment by Gerrit Updater [ 12/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35144/ | |||||||
| Comment by Gerrit Updater [ 20/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35314/ | |||||||
| Comment by Gerrit Updater [ 20/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35414/ | |||||||
| Comment by Gerrit Updater [ 30/Jul/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33779/ | |||||||
| Comment by Gerrit Updater [ 06/Aug/19 ] | |||||||
|
Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35704 | |||||||
| Comment by Gerrit Updater [ 21/Aug/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35704/ | |||||||
| Comment by Peter Jones [ 25/Sep/19 ] | |||||||
|
Landed for 2.13 | |||||||
| Comment by Gerrit Updater [ 22/Oct/19 ] | |||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36554 | |||||||
| Comment by Gerrit Updater [ 27/Oct/19 ] | |||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36554/ |