[LU-10070] PFL self-extending file layout Created: 04/Oct/17  Updated: 21/Dec/23  Resolved: 25/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.13.0

Type: New Feature Priority: Minor
Reporter: Andreas Dilger Assignee: Vitaly Fertman
Resolution: Fixed Votes: 0
Labels: FLR2

Attachments: PDF File LUSTRE-98540817-120718-1412-90.pdf    
Issue Links:
Related
is related to LU-9479 sanity test 184d 244: don't instantia... Open
is related to LU-9809 RTDS(Real-Time Dynamic Striping): A p... Open
is related to LUDOC-436 Self Extending Layout Documentation Resolved
is related to LU-8998 Progressive File Layout (PFL) Resolved
is related to LU-9771 FLR1: Landing tickets for File Level ... Resolved
is related to LU-10808 DoM: component end should align with ... Resolved
is related to LU-7880 add performance statistics to obd_statfs Open
is related to LU-9096 sanity test_253: File creation failed... Open
is related to LU-12681 Data corruption - due incorrect KMS w... Resolved
is related to LU-12712 sanity-pfl tests triggering “not SEL ... Resolved
is related to LU-13395 unable to set "--comp-flags=prefer" o... Resolved
is related to LU-13589 PFL "lfs setstripe -E 1M -S 65536" in... Resolved
is related to LU-9846 Overstriping - more than stripe per O... Resolved
is related to LU-10169 Spillover space Resolved
is related to LU-13058 Intermediate component removal (PFL/SEL) Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-12365 Create test plan for SEL Technical task Resolved Vitaly Fertman  
LU-12366 Create documentation for SEL Technical task Closed Vitaly Fertman  
Rank (Obsolete): 9223372036854775807

 Description   

One interesting idea discussed at LAD was to have a PFL layout that is "self extending".

For several use cases, such as HSM partial-file release/restore, partial-file migration to/from burst buffers, partial-file FLR resync it is advantageous to avoid the need to restore/migrate/resync the entire file at once, but rather only to process the required chunks of the file.

Essentially, a PFL file would have the normal few components that define the start of the file (e.g. [0-32MB), [32MB-1GB), [1GB-16GB)) and they would be instantiated as with normal PFL today. What is new for "self-extending layouts" is that the last component becomes the template for additional components (as needed) rather than having a component to "EOF" that freezes the layout of the rest of the file.

This avoids the overhead of explicitly specifying many identical components for the file, only in order to limit the size of the components that need to be processed.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 12/Oct/17 ]

If the new component is added with the same layout from the last component, what's the point of having a new component? Would it be better to just extend the extent of the last component?

Comment by Andreas Dilger [ 13/Oct/17 ]

As described above, there are several reasons for having separate components for the file. One would be to allow HSM files to be archived and restored incrementally. Another is to limit the amount of data that needs to be resync'd if a replicated file is modified, since only the stale component would need to be resync'd.

Comment by Nathan Rutman [ 30/Oct/17 ]

Please see an alternate idea in LU-10169

Comment by Nathan Rutman [ 25/Jan/18 ]

Are you thinking this would add an explicit new component each time?

Example layout [0-32MB), [32MB-1GB), [1GB-16GB), [SE +16GB)

  1. Client asks for layout at first open, gets [0-32MB), [32MB-1GB), [1GB-16GB)
  2. Happily writes away.
  3. When client tries to write beyond 16GB, LOV says hey, I don't know where to put that, requests layout update (is this a new RPC type?)
  4. MDS takes layout lock, adds new segment to layout [0-32MB), [32MB-1GB), [1GB-16GB), [16GB-32GB)
  5. Client requests layout again, gets the new component.

So after file grows to 1TB, there will be ~64 components?

Maybe the component definition could include a multiplier instead and we just increment that? Save ourselves the space in the layout description, and also re-use the OST list (so we're not further spreading the stripes around to new OSTs. Hmm, maybe we want this selectable, to rebalance or not?)

 [0-32MB), [32MB-1GB), [1GB-16GB), [SE +16GB)*64

Comment by Andreas Dilger [ 26/Jan/18 ]

My thought is that yes, it would add a new component each time, which also means new OST objects. That allows better space balance, if the last component is not fully stripes. It also allows independent handling of each component (migrate, resync, archive/restore, etc), which is the goal of this functionality.

Comment by Nathan Rutman [ 15/Feb/18 ]

I compared this to the LU-10169 idea here, but I'll just copy and paste to stimulate discussion.

Not handled as well by 10070:

  • Files must have a PFL layout already, primed for ENOSPC handling. 10169 could handle any layout type.
  • 10070 grants a fixed amount of space to the next component. Too small, you will end up with too many components; too large, you may still run out of space as other files fill the OST. You're counting on bulk behavior, but one bad file could still cause ENOSPC for everyone else.
  • Layouts continue to grow with 10070, even though the system may have plenty of space, requiring more MDT space and network traffic for updating client layouts.

Not handled as well by 10169:

  • Sparse writer case. Some writer way off at the end of the file may have an empty OST, and someone at the beginning hits a full OST. We can't change the extent of the existing layout to "short" without copying the data for potentially a large component. (If we can detect this, we could just give up and ENOSPC this one).
  • Probably more complicated implementation.
Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ]

Nathan,

About LU-10070, Ben Evans and I discussed it a bit on the Cray ticket. (Sorry, we should've done that here.)

He and I both had a thought, I'll quote (lightly edited for clarity):
"Ben: I think the answer to the number of components is that rather than creating a new one each time, to simply edit the end of the last one if there is no need to change."

"Paf: Agreed. Vitaly noted that it should be possible to extend the existing final component without violating the rules that require "no changing of existing layout components", because it should not impact any ongoing I/O on any clients. The idea would be to combine these two things - When the client reached the end of the current layout, it would go the MDS and either get an extended version of the current component or a new component.

We would need to make sure somehow the client knew to go look for more layout. It might be sufficient to just revoke the layout lock and then the client, when it needed layout information, would have to go get new layout information. That new information could either be an extension OR new components."

"Paf: The other issue Ben and I discussed - which I'm wondering if you've got more on, Ben - is how to decide when to do spillover? If we're talking in terms of LU-10070, then we can't do it just when we would get ENOSPC (that's the more complicated interaction w/grant and ENOSPC and such that's described in LU-10169), we have to some way of deciding when to do it.

Ben and I discussed two ideas, which I hope I'll describe decently here:
1. An explicit high water mark on the OSTs, which would be something new we'd presumably have to have admin interaction for
2. Some sort of "scoring" on the OSTs, possibly derived from the current striping policy. Basically, it would require that we be able to somehow compare the optimality of continuing the existing layout to moving to a new set of OSTs. It seems like we might be able to retrofit the existing striping allocation rules to somehow give a score or ordering... But that is more complicated, and might not be workable. For example, the stripe allocator might put a given set of OSTs last if it were only slightly more full than other OSTs. So we would end up generating new components over and over just to maintain balance, which defeats the point of having extendable components.

Nathan, do you have a good sense of whether or not the allocator could be retrofitted to serve that purpose and whether or not that's a good idea? Or just a reaction in general?"

Comment by Patrick Farrell (Inactive) [ 16/Feb/18 ]

Cory pointed a flaw in this one, which I hadn't thought through (but I think the other folks here had.).

Self-extending layouts do not play very nicely with PFL layouts, for certain values of "nicely". A self extending (or repeating) component must (more or less...) necessarily be the last component of a PFL file while it is being extended. That means in your usual model of a PFL file, striped something like this:
1 stripe ----------|2 stripes--------------------------|4 stripes-----------------------------------|8 stripes ----------|

The last component would have to be the one that was repeated or extended. This means that (at least as this is currently imagined) we would only change striping on files which were in their final component of the PFL layout.

Just for reference, this is how I imagined implementing this:
1 stripe ----------|2 stripes--------------------------|4 stripes-----------------------------------|8 stripes ----------| [magic component describing extension/repeat/spillover behavior, including how much to extend if extending]

When we get to the end of 8 stripes, if we extended (rather than moved to another pool), it would look like this:
1 stripe ----------|2 stripes--------------------------|4 stripes-----------------------------------|8 stripes ------------------------| [magic component describing extension/spillover behavior]

If we repeated, it would look like this:
1 stripe ----------|2 stripes--------------------------|4 stripes-----------------------------------|8 stripes ----------|8 stripes ----------| [magic component describing extension/spillover behavior]

The magic component could specify a pool to spill over to as well.

Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ]

A further thought. We (Cray) have toyed with the idea of an extent level tracked tiering, something to do caching with a small granularity. This is very similar to the idea Andreas describes, where layout components could be moved around independently, the problem is reasonable extent sizes for automated caching on a faster tier do not match up with reasonable component lengths. We can't have too many components, so they have to be large, but if we want to do extent based caching, extents need to be pretty small.

The goals are not entirely dissimilar, but they don't lend themselves to the same implementation.

Andreas - Are you really not worried about the implications for MDT storage space of having a layout component for every, say, 10 GiB? Just wondering what you're thinking there.

Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ]

I believe the current behavior of O_APPEND and PFL files makes self-extending layouts impossible (assuming O_APPEND support is required, which seems logical).

Comment by Andreas Dilger [ 21/Feb/18 ]

To comment on the several different issues that were raised in the last few comments. In general, I don't see this proposal as the best possible solution to all issues, but it definitely can address a number of common issues, and wouldn't need a huge amount of effort to implement. Existing PFL clients should "just work" work with the proposed self-extending layout.

Andreas - Are you really not worried about the implications for MDT storage space of having a layout component for every, say, 10 GiB?

Yes, definitely this is something to think about. We can't efficiently handle arbitrarily large file layouts, so there is some tradeoff between the extent size and the number of components that can be added. This is also partly true of PFL files in general, but the expectation is that the last component would extend to EOF, so puts an upper limit on the layout size. With self-extending layouts, this would no longer be true, but we could probably handle this in a few ways:

  • the MDS could extend the last component using the existing objects if there is no significant reason not to (e.g. OST almost full), etc. but that also loses some of the benefits
  • increase extent size as the file gets larger, so the number of components doesn't scale linearly with file size, but again this loses some of the benefits for stale component resync
  • give up on this idea for file resync and use a separate method of tracking sub-component dirtyness (e.g. a bitmap)

the current behavior of O_APPEND and PFL files makes self-extending layouts impossible

This might need a small change on the client, but I don't think it is hard to handle. The client just needs to hold the layout lock, and whether there is a component that goes to layout EOF or not is irrelevant. It just needs to know where the actual file's end of data is and lock all of the objects at or beyond there, whether it is inside a fixed-length component or not. While it is holding the layout lock, no other client can instantiate a component with objects after the end of the file.

An explicit high water mark on the OSTs,

That is already implemented with the osp..reserved_mb_high and osp..reserved_mb_low tunables. The MDS will avoid allocation on the OSTs once they exceed the high watermark, until they drop back below the low watermark again.

Some sort of "scoring" on the OSTs

This has been discussed for a long time already. Li Xi implemented a proposal in LU-9809 to allow userspace to periodically supply the scores for each OST, so that the weights can be managed in an arbitrarily complex manner. The actual OST selection would still need to be handled by the MDS in the kernel, since we can't depend on userspace to keep the weights updated for each file creation, but that would be enough for balancing space, performance, maintenance (RAID rebuild, etc). I think that is a wholly separate project, and would welcome feedback/assistance in LU-9809, but it shouldn't be mixed in here.

Comment by Ben Evans (Inactive) [ 28/Feb/18 ]

I think I've been thinking of this slightly differently. For any component, you'd have a "chunk" size where you'd check to see if you need to switch layouts to a new set of OSTs. Most of the time, then answer to this is "no", so you'd extend the current extent one more chunk, to the end of the current component.

The benefit of this is that for systems with MDT, SSD, HDD pools, you should stay within the performance pool as long as specified, without getting kicked to the end (slowest) component, though that probably needs to be an option if the current pool is full.

In order to do this, the scoring for any chunk needs to check if it's bad enough to require a change, with the default of "no change".

Comment by Patrick Farrell (Inactive) [ 28/Feb/18 ]

Ben,

That's basically how I've been thinking of it too, but Andreas and I believe Nathan (in an internal ticket or conversation?) have suggested that we just use the threshholds for making that choice. So no scoring system, though something to do that could be integrated later.

Comment by Patrick Farrell (Inactive) [ 05/Mar/18 ]

Posting some comments from Cray ticket.

Iterating on a few aspects of this, I wanted to post an example of what a setstripe command could look like as a way to explain what I'm thinking and get feedback.

Here's an example of a simple PFL command, I'll break this down first:
lfs setstripe -E1m -c1 -S1m --pool="pool1" -E2m \
-E-1 -c2 -S2m --pool="pool2" testfile

The first component is 1 stripe, with a stripe size of 1 megabyte and an end of 1 megabyte, on pool "pool1".
The second component ends at 2m ("-E2m" is the entire component description) and has no stripe size or stripe count info set, so it gets the defaults (1 stripe and 1 MiB).
The third component goes to infinity (-E-1), has two stripes with a stripe size of 2 MiB, and uses "pool2". (Start points are not specified, they are always the end of the previous component.)

Here's an imagined setstripe command:
lfs setstripe -E1g -c2 -S1m --pool="pool1" -E 100g -e10g --component-flag extension -E -1 -c4 -S1m --pool="pool2"

The first component ends at 1 GB, has two stripes w/1 MB stripe size, and uses pool1. The second component is an "extension" component, which will never be instantiated. It ends at 100G (so the maximum size on pool1 would be 100G), and it will give extensions of 10g at a time. This is followed by the last component, which ends at infinity, and uses 4 stripes (1 MB stripe size) on pool2.

The idea is that when we get to the end of component 1, we hit component 2, and the client sees the extension flag. It uses this to make an extension request to the server. The server can either return a new component one with an end at 11G, or it can refuse - Then it would extend component 3 (the one on pool2) forward to fill the space previously used by the "extension" component, removing that component.

Eventually, when/if we reached the end of the extension space (component 2), we would also remove component 2 at that time, so component 1 (on pool1, now 101 GB in size) would be adjacent to the old component 3 (the component on pool2).

Sparse files raise some interesting questions, which I'll talk about shortly. There are a few different choices to make.

----------------------
To be clear, the server would refuse to extend component 1 when it noticed that at least one of the OSTs was over the threshhold space. I forgot to mention that in my previous comment.

---------------------
Taking the file from my previous comment, consider what to do if someone writes at an offset of, say, 25 GB.
(lfs setstripe -E1g -c2 -S1m --pool="pool1" -E 100g -e10g --component-flag extension -E -1 -c4 -S1m --pool="pool2")

So they're writing in to the "extension" space. Probably the simplest thing to do is just do the space check on the OSTs once, and extend the first component to 30GB, the next multiple of the extension size.

This means that sparse files can kind of mess up this feature, by getting a lot of space allocated... But sparse files are already hard.

Now consider one more example, where the user writes at an offset of 120 GB to that same file.
We are now past the extension space and in to the component after it. I think the think to do is instantiate component 3, starting at the maximum extent of component 2 and going to (in this example) infinity. Nothing would change with component 1 or component 2.

Then, later, if we tried to extend component 1 and the space check failed (so we cannot extend it), we would destroy component 2 and extend component 3 to the end of component 1.

So it would look like this, if component 1 were 30G at that time:

component1[pool1]------30G|component2 [never instantiated]---100G|component3[pool2, instantiated]-----------infinity

Then if we failed to extend component 1, we would change things to look like this:
component1[pool1]------30G|component3[pool2, now component2]-------infinity

Comment by Patrick Farrell (Inactive) [ 05/Mar/18 ]

One note is that as proposed, this doesn't use the space in pool1 per-se. Each component will only use the space on its own OSTs. We could imagine a more complex re-layout-ing policy that, when it ran out of space/hit the threshhold on current OSTs, would create a new component in the "extension" space, using other OSTs from pool1.

By the way, one interesting detail of the layout implementation as it works today is that it would probably be possible for clients that do not understand the extension flag to use the self-extending layouts anyway. Almost everything is server-side, clients basically bail and ask the server when they need new layout, and restart the i/o entirely, after getting a new layout. This gives a lot of power to the server, letting it do clever things without the client needing to know.

Older clients wouldn't be able to create these layouts with setstripe, so it may not really matter, but... yeah.

Comment by Nathan Rutman [ 09/Mar/18 ]
lfs setstripe -E1g -c2 -S1m --pool="pool1" -E 100g -e10g --component-flag extension -E -1 -c4 -S1m --pool="pool2"

I don't like that we have a pre-set maximum size for component 2. Or maybe that's ok for some, but I want to be able to do this as well:

lfs setstripe -E1g --pool="MDS" -E -1 -e10g --pool="flash" -E+400g -e100g --pool="10Krpm" -E -1 --pool="5Krpm"

I want this to mean:
1. first 1G component on pool MDS.
2. next component extends in 10G chunks in pool Flash, forever, until MDS says no more space for you. Component 2 is now fixed and will never extend again.
3. next component extends in 100G chunks on 10Krpm pool as long as it is <= 400G in size ("+400g"), or MDS says no more space.
4. final component to EOF on 5Krpm pool, forever. If pool runs out of space, writer gets ENOSPC.

Contrary to a previous comment, the extending component is not the last one. The use case here is we want to use all of flash, but we don't want to ENOSPC when we do run out. We need another component on a different pool, to be used only when we can no longer extend. The "+400g" business is new, but if we don't cap the size of component 2, it means we can't know an absolute end point for any subsequent components. So it's a relative max size instead of an absolute. If we don't implement this, then it seems that any following components need to have a "-1" end.
Also note here we have multiple extending components. I don't think this introduces any more complication.

Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ]

Ah, sure, that makes sense. Specifying the endpoint for the extension space would definitely not be mandatory, and, sure, we could use -1 for "no explicit limit". If we did that, then nothing beyond it would be reachable for instantiation - it would "cover up" all later components until we did something to explicitly change that. That makes sense to me.

The E+400g component would have to be handled somewhat differently, it might be easiest to tag that component with a flag saying it's a "relative end", and adjust processing accordingly...

In fact, actually, what you're describing would need a few more components, at least, as I imagined it - We could make "-e10g" shorthand for another component, but...

Here's how I would write out the layout you described:

lfs setstripe -E1g --pool="MDS"[<-component 0] -E 10g --pool="flash"[<-component 1] -E -1 -e10g --component-flag extension[<-component 2, extension space] -E 100g --pool="10Krpm"[<-component 3] -E+300g -e100g --component-flag extension[<-component 4, extension space] -E -1 --pool="5Krpm"[<-component 5]

component 0 is the MDS component, component 1 is the flash component, component 2 describes the extension behavior of the flash component, component 3 is the 10K RPM component, 4 is the extension space component for that, and component 5 is the final component.

That's obviously a lot more verbose, but it describes the extension space components as distinct components with their own limits, which is (I think) the easiest way to implement this. (The E+300g is deliberate, since it would be a 100g component with 300g of extension space. That may not be the best way to describe this.)

We'd have to tweak things so we could add components after a -1 component, presumably only when such a component is tagged as extension space. That's also fine.

Agreed that multiple extending components shouldn't add any particular complexity. (At least, no more than allowing any component to follow an extending one.)

Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ]

Quick other thought:
Sparse files are kind of an irritant. Basically, a distant write in to uninstantiated layout space. Imagine someone writing a byte at an offset of 100TiB in our example above. I think we probably just have to do a one off space check with the MDS, asking it if we're over the threshold, and if not, just extend the current layout to that point. So now we've got a 100 TiB layout on our inner tier.

That's not ideal, but I don't see any way to do any better with extent based layouts and sparse files.

Comment by Nathan Rutman [ 09/Mar/18 ]
-E -1 -e10g --component-flag extension


Isn't --component-flag extension implied by -e10g?

Your comment sounds like it mandates a minimum flash component of at least one in our example above by doing this -E10g --pool="flash" -E -1 -e10g --pool "flash". But I don't see why that can't just be -E-1 -e10g --pool "flash" (i.e. just the second half) which would be "zero or more 10g extendable segments". The extension behavior of a component is specified with -e option on the component itself. There might be zero instances of the extension (or component) if there is no space in the flash pool. I think this is both simpler to write, as well as more flexible behavior. (If you require one segment on flash, make two components, one fixed and one extensible. If not, make just the one extensible component.)

We'd have to tweak things so we could add components after a -1 component

I am imagining we can normally ignore any components after a -1, and that if a -1 extendable runs out of space at the next extension request, then that -1 turns into a hard limit, and the next component is used. Hopefully this makes sense.

Sparse files - if someone writes to offset X, we have to generate all extendable components up to and including X. Sadly, none of these will consume space in the components, meaning future writes at earlier points in the file may well run out of space - we can overallocate the pool with this. I'm not sure that LU-10169 can handle this any better. I propose delaying a resolution as a future improvement in a new ticket once somebody decides they really care about this case. (We could get fancier by only instantiating the extension segments with data, so that the layout itself would have "holes". When a client tries to write into a hole, the MDS must insert a new component before the instantiated extensions with an allocation in a different pool. Sounds like a pain.)

Comment by Patrick Farrell (Inactive) [ 10/Mar/18 ]

"Isn't --component-flag extension implied by -e10g?"
Oh, sure - it certainly could be, hadn't occurred to me.

So to be clear (perhaps it already is, sorry if so), the suggested implementation here is one component which is (or, at least, could be) instantiated, trailed by another component which tracks the extension space (if we limited it) and also serves as our catch for the client This component will never be instantiated.

When the client tries to do i/o to that component, it will ask the server to take care of instantiating it, and at that point, the server instead edits the extent of the previous component. This "ghost"/extension space component will disappear (ie we will delete it) when we stop extending the preceding real component.

That is to say, internally, a component is not self-extending - It gets that from the ghost component that follows it.

Nothing says we have to leak that "ghost" component to userspace, at least not in the setstripe command, though. -e10g is definitely a clearer interface for users than an explicit "extension" component.

So that implementation doesn't exactly conflict with the idea of potentially having a component start at zero size and possibly never get extended (which makes sense - I hadn't thought of it that way, but it's clearly the way we'd want it to work), but... Hm. Actually, I suppose it would be fine. There are perhaps two choices:
1. Have the component that will become "real" once/if we extend it for the first time present but with 0 extent (so we'd hit the extension space and everything would act like normal)
2. Have it not exist and have the extension space create it when hit for the first time.

While a 0 length extent (we might flag it somehow to make it easier to ignore...) is going to be a bit odd, I really like the simplicity of keeping the extend behavior the same.

Comment by Nathan Rutman [ 12/Mar/18 ]

@paf - so you're saying the ghost is a simple way to trap writes beyond end-of-allocated-layout. I suppose that's fine, bearing in mind that I really would like to keep the lfs command as simple as possible. Along those same lines, if they specify an extendable component, it should show up in the layout with lfs getstripe no matter what its size, i.e. including zero.

@adilger Back on the question of whether to add new components at every extension or not, I would much rather not. Avoids the layout growth problem, and again keeps things simpler for users. If someone really wants to migrate chunks/smaller components later, they can manipulate the layout at that point. In any case, it doesn't seem right to overload this ticket (or Lustre layout handling) with that additional constraint. KISS for now, we can always revisit later.

Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ]

I wanted to highlight an issue that diverges a little from what we've been discussing, but seems important.

There is, today, no threshold for taking an OST out of striping rotation for "out of space" reasons.  The current proposal relies on implementing something to do that.  Probably not very difficult to do, just something I wanted to write down here.

Comment by Ben Evans (Inactive) [ 15/Mar/18 ]

Patrick, I think that's a separate issue, and this can work well enough without that, simply having a good "fitness" check, where we see if it might help if we restripe.  My assumption is that for any OST pool, this will give us better fullness levelling than we currently have, so rather than a single OST hitting full, you'd have all of the OSTs in a pool hitting it around the same time, and you'd be dealing more with a FS out of space than an OST out of space.

Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ]

Ben,

I know this is an area you've thought about more than I have...  What would you suggest as a fitness check, specifically?  (Or as specifically as you've considered it, at least  )

Comment by Ben Evans (Inactive) [ 15/Mar/18 ]

Simple one:

If no OST is more than 75% full, there's also no reason to change.

Establish the median OST size.  If, in the current stripe, 75% of the OSTs involved are under the median, that's good enough, and there is no reason to change.  If part of the stripe is on the fullest OST, then it's time to change.

 

 

This brings up an odd question, does anyone actually use stripe offsets in production (like always starting at OST 0?)

Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ]

Interesting.  This bit:

Establish the median OST size.  If, in the current stripe, 75% of the OSTs involved are under the median, that's good enough, and there is no reason to change.  If part of the stripe is on the fullest OST, then it's time to change.

Sounds similar to the QOS allocator, which is intended to avoid striping to an OST if it diverges too far from...  I'm not sure if it's the least full OST or an average.  Thresholds are set for that already.  Could we leverage it, rather than replicating some part of its functionality?  (I'm not sure we could, that's a serious question)

Comment by Ben Evans (Inactive) [ 15/Mar/18 ]

I was thinking we'd call the current allocator (maybe slightly modified) for a new stripe if we need it. This would be about detecting if we needed a new stripe, and setting a relatively high threshold to actually change the stripe and add a segment.

Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ]

OK, that makes sense.  I'll keep thinking about it.

I get nervous when you say "segment" - The current proposal is to add space to an existing layout component until this check we're discussing says "no", then we switch ("spill over") to the next layout component, which will already exist.

Actually, as I say this, it occurs to me that we could put two back-to-back extendable components on the same pool, which would have the effect of "checking the rest of the pool" for space before spilling over to the next pool.  (Basically we'd try to spill over from our current OSTs to elsewhere in the same pool before spilling over to a new pool.  But we'd only do that once[Spilling over to the same pool over and over would hit a pathological case where we'd end up with every file on almost every OST as the OSTs filled up.])

Comment by Andreas Dilger [ 15/Mar/18 ]

The second component ends at 2m ("-E2m" is the entire component description) and has no stripe size or stripe count info set, so it gets the defaults (1 stripe and 1 MiB).

One clarification here - the second component inherits all of the values from the first component if they are unspecified, and the first component inherits them from the parent directory or root directory layout, if they exist, or filesystem-wide defaults if nothing else. There was a long discussion in LU-10561flr: remove "--parent" option from lfs mirror command about what to inherit for a specific component if the parent/root directory layout is a composite layout? This is trivial if the whole parent/root layout is inherited (either simple or composite), but what if the parent has a composite layout and someone uses "lfs setstripe -E32M -c4 -E32G --pool slow"? What should the two components use for stripe_count, stripe_size, and pool?

There is, today, no threshold for taking an OST out of striping rotation for "out of space" reasons. The current proposal relies on implementing something to do that. Probably not very difficult to do, just something I wanted to write down here.

There is indeed such a mechanism, see osp.*.reserved_mb_high and osp.*.reserved_mb_low on the MDS, added in 2.9 by aboyko. The comment in the code for the high and low watermarks say (and this is even documented in the user manual!):

 * Show high watermark (in megabytes). If available free space at OST is greater
 * than high watermark and object allocation for OST is disabled, enable it.

 * Show low watermark (in megabytes). If available free space at OST is less
 * than low watermark, object allocation for OST is disabled.

These values are initialized as a fraction of the OST size initially (0.2% and 0.1%, respectively), and can be set to an absolute value at runtime. I don't think they allow specifying a percentage at runtime, but that was important to you it might be possible for you to add a decimal value with a '%' unit?

This allows the admin to set a low threshold below which an OST is no longer considered for allocation, then it is allowed to drain (to reduce free space fragmentation) until it hits the high watermark again. In terms of more sophisticated changes to the LOV object allocator, this is being discussed in LU-9 and LU-9809. DDN had some proposals on that front, and Nathan and I discussed it with Li Xi at LAD, but nothing further has come of it so far. Please move any allocator discussions over there, so hopefully it can be implemented at some point.

I guess my main question here is what the main benefits of your current proposal are? It seems they are mostly focused on avoiding out-of-space on an OST? The proposal of extending existing components is quite different from my original proposal, where having separate components was desirable to allow partial-file release/restore, migration, and resync. My expectation that the out-of-space issues will already largely be handled by regular PFL layouts, as long as they are set up reasonably (i.e. in the neighborhood of component_end < reserved_mb_high * stripe_size, with a -c -1 component at the end) and the MDS allocator is working properly.

IMHO, we could spend ages on making complex layouts (which users are relatively unlikely to understand or use), and would be far better off to improve the MDS object allocator to do a better job of balancing space between OSTs, but without the performance hit of the full-random QOS allocator that we have today. With a good allocator and PFL, the OSTs would always be evenly used, so any single OST is only going to fill up at the point when all of them fill up.

We would need to take some care with different storage tiers/pools, possibly having the MDS drop some preliminary components completely at file creation time if their pool is close to full. We've already started discussing that in LU-10808 for the context of DoM components being skipped for new files if the MDT holding the inode is full (creating the inode on a different MDT at that point is a different beast, parts sold separately).

Comment by Nathan Rutman [ 15/Mar/18 ]

 I guess my main question here is what the main benefits of your current proposal are? It seems they are mostly focused on avoiding out-of-space on an OST? 

More specifically, avoiding out-of-space on a pool, while maximizing use of that pool. The scenario is that a flash pool should be used as long as there is space, but if it runs low we should change the layout. We can't specify this with PFL now. LU-10169 was titled "Spillover Space" rather than "self-extending file layouts", but there seems too much overlap to keep them separate.

Can we do this with the allocator alone? No; the allocator is only used at create time, and maybe at first-write-into-component time. With that constraint, the only way to change a layout based on current fill is to add a new component; i.e. a self-extending layout description. This ticket is to me about changing the static PFL layout into a dynamic one.

The proposal of extending existing components is quite different from my original proposal, where having separate components was desirable to allow partial-file release/restore, migration, and resync.

Agreed, this is different. Yours has benefit as well, but I think the general problem is that people won't know ahead of time to make a SEPFL layout so that they can later migrate it. We really need a way of changing any layout after data is placed; SEPFL can't solve that, so I think that idea really belongs under a "restripe an existing file efficiently" ticket. My 2 cents.
@paf tried to thread the needle by suggesting we allow both "extend old segment" and "add new segment" options, via some config flag. I have no objection to this. My only worry with the add new segment option is that it will grow the layouts too big; both options could facilitate the dynamic PFL.

Comment by Ben Evans (Inactive) [ 16/Mar/18 ]

Why wouldn't SEPFL always be on with some reasonable minimum/default for the extension size (like 2GB)?

Comment by Nathan Rutman [ 16/Mar/18 ]

Like any layout it would have to be specified. Sure, you can set it as a FS default, but if you know you have a large file you might specify wide striping from the start, or DoM, or anything else. You also might want your final layout to have components at different boundary points than your original.

I think the "move an extent range to a different layout" problem is probably better handled with a range-aware FLR than depending on PFL components. FLR the segment you want to move into a secondary layout. Doesn't have to line up with a component boundary in the primary.

Comment by Ben Evans (Inactive) [ 16/Mar/18 ]

well, the default stripe is 1 random OST, with a 1MB stripe size.  There's no real reason we can't extend that to have a default SEPFL (or whatever we're calling it) chunk size.  So it's always there, even if noone ever touches the striping, ever.  You bake in the feature so that it is always-on.

Comment by Andreas Dilger [ 16/Mar/18 ]

It is possible to set a filesystem-wide default PFL layout by setting it on the root directory. If no default is found on the parent directory, then the layout is gotten from the root directory (though soon this may also come from a "template FID" for nodemap+ subdir mount clients that don't have access to the root directory, see LU-9982).

One open question that I'd like input on is how to handle layout inheritance from a composite file if one wants to specify some different parameters from the default? For simple layouts this is straight forward - fill in the parameters (stripe count, stripe size, pool) from the parent/root/fs if they are not specified for the new file. However, if the parent/root have a composite layout but the user also wants to create a composite file, what gets inherited?

Comment by Nathan Rutman [ 16/Mar/18 ]

Inheritance - as you pointed out earlier, these layout settings are getting complex, what with PFL, FLR, SEPFL, DOM, pools, flash, etc. Soon, as I've mentioned elsewhere, some elements of policy are implied by layouts and perhaps should be made explicit. So here's a proposal:
define a set of template layouts in JSON format. Store and reference them by name. Then allow users to setstripe --name "large_sequential" or "smallguys". Anything that is not defined in the JSON is inherited from the parent dir. Anything that is defined can be overridden in the specific setstripe invocation. JSON files could even inherit like derived classes if we wanted to go that far.
(Potentially we can trade in 'lfs setstripe' for 'ladvise'.)
BUT - not for this ticket

Comment by Patrick Farrell (Inactive) [ 30/Mar/18 ]

I've been a little quiet on this (working on prototype + HLD, etc), but I had a thought I wanted to describe here and get feedback on.

I think Ben may have already been imagining something like this, based on some things he said the other week, but it's a new idea to me.

Specifically, consider the behavior of a file with one self-extending component going to infinity:

lfs setstripe -c 2 -E 100M -E -1 -e 10M --component-flags extension
(single component, ending at 100M, followed by extension space of 10M at a time, extension space going to infinity)

The first component will end up on two particular OSTs, and then when they fail an extension check (basically, when they're too full, for some definition of "too full"), we'll just stop extending it, and return -ENODATA (I think that's it) when the user tries to write to the part of the layout that doesn't exist.

If there's another component following that extension space component, we would of course use that, consider this example with that additional component:

lfs setstripe -c 2 -E 100M -E -1 -e 10M --component-flags extension -E -1 -c 2 --pool="pool2"

The file layout would be the first component getting extended until it failed an extension check, then we'd initialize and use that final component on "pool2" for the rest of the layout.

But what about that first case, where we don't have a component following the self-extending component + extension space?

In that case, what if instead of just stopping, we made a new component with the same striping properties as the old one?  Not the same OSTs, but the same striping properties - size, count, pool if specified.

So the layout would look like this:

---------comp1, ost0,ost1----------- [extension space .....]

Then one of those OSTs comes up "full".  The behavior we've described already would just remove that extension space, which would look like this:

---------comp1, ost0,ost1-----------  

And fail further writes to the file.

But if we added a new component, like I suggested above, it could look like this:

---------comp1, ost0,ost1----------- comp2, ost2, ost3----- [extension space .....]

And then we wouldn't have to stop writing to the file because the OSTs it was on are running out of space.

This would only be desirable in cases where no further components were specified - If further components are specified, the behavior described in them should control.

This could either be the default behavior for self-extending PFL components with no further components or it could be controlled by a flag.  Thoughts?
(Again, I think some folks may have already been imagining a version of this...  )

Comment by Patrick Farrell (Inactive) [ 30/Mar/18 ]

Note this behavior has one obvious downside, which is that if we're filling up all available OSTs (either the whole file system or a pool) we could "wander" between different OSTs, ending up with files with many small-ish components on many OSTs.  I don't think this is enough of a problem that we should not do this, it's just something to consider.

Comment by Andreas Dilger [ 02/Apr/18 ]

Two notes on practicality of usage here:

  • if the layout is extended only in 10MB chunks, that would mean that writes are essentially going to be RPC bound with the MDS, since a single client writing may get about 5GB/s/10MB = 500 RPCs/s to the MDS just for layout extensions, per client, which is totally impractical. The larger these component extensions get, the fewer the RPCs to the MDS, but also the more likely that an OST could run out of space before the file is finished writing. That is why PFL suggests files become more widely striped as they get larger, so that the data is spread across more OSTs rather than filling up only a few OSTs. The best way to avoid premature out-of-space is to avoid the OSTs becoming imbalanced in the first place
  • if the layout extension is not "sticky" on the existing OSTs, it will cause a large number of components to be created, quickly making the layout itself get too large. I believe your goal is that the existing component would continue to be extended on the same OST objects until they are no longer suitable, so the layout size will remain constant until different OSTs need to be selected. When a new component is created (presumably on new OSTs), it should pick OSTs that will remain suitable for some longer period of use to avoid this issue.

Note that the goal shouldn't be to fill OSTs completely before moving on to new OSTs, since that increases the chance that some other file will run out of space before it can move to a new component. Whether we hit out-of-space 1s earlier or later is not critical in most cases, since we would run out of space in any case.

Comment by Patrick Farrell (Inactive) [ 02/Apr/18 ]

if the layout is extended only in 10MB chunks, that would mean that writes are essentially going to be RPC bound with the MDS, since a single client writing may get about 5GB/s/10MB = 500 RPCs/s to the MDS just for layout extensions, per client, which is totally impractical. The larger these component extensions get, the fewer the RPCs to the MDS, but also the more likely that an OST could run out of space before the file is finished writing. That is why PFL suggests files become more widely striped as they get larger, so that the data is spread across more OSTs rather than filling up only a few OSTs. The best way to avoid premature out-of-space is to avoid the OSTs becoming imbalanced in the first place

Absolutely.  The 10 MiB examples have purely been to keep the numbers small, I was thinking 1 MiB minimum, 1 GiB if nothing was specified (maybe even 10 GiB?), and 10 or 100 GiB being typical.  I suppose if I think 10 GiB is typical that should be the default, but that's all easy to adjust.

And, yes to the rest of that as well.  It's a tradeoff.  Avoiding imbalance is good but also somewhat separate.

if the layout extension is not "sticky" on the existing OSTs, it will cause a large number of components to be created, quickly making the layout itself get too large. I believe your goal is that the existing component would continue to be extended on the same OST objects until they are no longer suitable, so the layout size will remain constant until different OSTs need to be selected. When a new component is created (presumably on new OSTs), it should pick OSTs that will remain suitable for some longer period of use to avoid this issue.

Yes, my intention is that it would be sticky, for the reasons you gave.  And also, yes - we don't want to pick OSTs that are almost full.

Note that the goal shouldn't be to fill OSTs completely before moving on to new OSTs, since that increases the chance that some other file will run out of space before it can move to a new component. Whether we hit out-of-space 1s earlier or later is not critical in most cases, since we would run out of space in any case.

Agreed on this point as well.  Still chewing on details - this "policy" aspect of it should be relatively easy to tweak, though, both in terms of what we implement and potential tunables.

Comment by Patrick Farrell (Inactive) [ 09/May/18 ]

A question, probably mostly for Andreas.  This is something from one of the FLR design docs, describing userspace tooling:

"A file with a simple layout is converted to a composite layout whose sole component is the previous layout."

Has this been implemented/is it planned explicitly anywhere?  It has some loose relevance here.

Comment by Andreas Dilger [ 10/May/18 ]

Patrick, there was not a command line interface for doing this, as there was no value to do so. However, for FLR it is possible to take a plain layout and convert it to a component  and then add a mirror to the file. 

Comment by Patrick Farrell (Inactive) [ 10/May/18 ]

Andreas,

How do you take a plain layout and convert it to a component of a composite layout?  It sounds like you're saying there is already a mechanism or doing so?  Or are you saying it would be possible to create one?  (In which case, yes, I agree and see roughly how it would work.  I'm hoping someone has already done it.  )

Comment by Zhenyu Xu [ 10/May/18 ]

Converting a plain layout to a component is only done as a LOD internal function in lod_layout_convert() which serves lfs_mirror_extend() to add a mirror to an existing plain file and constructs the file with two mirrors.

Comment by Patrick Farrell (Inactive) [ 30/May/18 ]

Question for Jinshan or anyone else interested in this about the interaction with FLR.

Today, lfs mirror resync (and the mirror_io test) invoke llapi_mirror_resync_one with the "end" of the region to resync set to the end of the relevant component of the mirror, without regard to file size.  (They stop copying once "read" returns less than the # of bytes requested.)

Is there any problem with using the logic of
if (end > file_size)
         end = file_size;

Instead?

Here's a sample scenario.

Mirror 1 is from 0 to EOF, it's preferred.
Mirror 2 is 0 to 10 GB, followed by extension space to EOF.

We write 1 MiB to the (currently empty) file.  Mirror 2 is now stale.

Then we do lfs mirror resync.

Currently, that results in us attempting a read to EOF, which will fully instantiate that self-extending layout.  Alternately, if we use file size as the maximum for end, we will have an end at 1 MiB and we will not unnecessarily extend the self-extending layout.

Is there any reason this wouldn't work/would be a bad idea?  [I assume I may have to adjust some sanity tests.]  Quick testing suggests it's fine, and without it, self-extending PFL layouts and "lfs mirror resync" won't work well together.

Comment by Andreas Dilger [ 31/May/18 ]

Patrick, it isn't clear why the self-extending layout would be affected by reads? The resync shpildbt be writing any data beyond the file size, so that shouldn't cause the layout to be extended.

Comment by Patrick Farrell (Inactive) [ 11/Jun/18 ]

Ah, yes - Sorry, I got this confused with a different stage of the operation which was causing the layout instantiation.  Thanks.

Comment by Patrick Farrell (Inactive) [ 12/Jul/18 ]

Attachment is a design doc for reference.  Some small updates are planned, but it is largely complete.

Comment by Gerrit Updater [ 12/Jul/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/32812
Subject: LU-10070 lod: Self-extending Layouts
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 96afe1091bd2287029023c42aaebac3f2e7aca96

Comment by Joseph Gmitter (Inactive) [ 12/Jul/18 ]

Hi Patrick,

Any objection to keeping the design doc on the lustre.org wiki?  I would be happy to port it over there into mediawiki format.

Thanks.

Joe

Comment by Patrick Farrell (Inactive) [ 12/Jul/18 ]

Joe,

None whatsoever.  In fact, I was planning to get it there eventually.  Would having it in another format than PDF - .docx is an easy option - be helpful?

Comment by Joseph Gmitter (Inactive) [ 12/Jul/18 ]

.docx would be great.  I use pandoc to convert the docx to get it most of the way there.

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33777
Subject: LU-10070 lod: Fix replay-single test_85b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6602b4f0379ad1eaefea487cd84a1a8ffc0ac8c0

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33785
Subject: LU-10070 lod: SEL: Add FLR support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12cba2a3c58aa8c9203c8d6718df618c69d2af50

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33778
Subject: LU-10070 lod: New test-framework functionality
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc682492c85a6ed73b71e640a65a1c9f44fd034c

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33779
Subject: LU-10070 lod: SEL: llapi_layout_test enhancements
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 56f847b3f3c203025f39beb87543ef7e6553eb7e

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33781
Subject: LU-10070 lod: SEL: split declare_instantiate_comp
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b4fd94b72b13741902a2bddb6a55244138b48567

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33780
Subject: LU-10070 lod: split lod_del_layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0bb8777e0cadd4b401b899a066f03d2eaf768e6

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33782
Subject: LU-10070 lod: SEL: Add flag & setstripe support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8b8e63d104cd89e72ba8d06958d2b9a9b1614fa3

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33783
Subject: LU-10070 lod: SEL: Implement basic spillover space
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cd85277472d1eaea8a37fb60c74ddf7e715ee64e

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33784
Subject: LU-10070 lod: SEL: Layout sanity checking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bbc403d2d92a4d1c98d880a36e17bdd96809c276

Comment by Gerrit Updater [ 05/Dec/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33786
Subject: LU-10070 lod: SEL: Repeated components
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2704bc500cf53649e4581cf1a1c88e2e7e15f55d

Comment by Patrick Farrell (Inactive) [ 05/Dec/18 ]

There are two to-dos remaining (other than addressing reviewer comments, of course  ):

  • Add at least one test for the YAML based layout interface
  • Add to the man pages (nothing there yet)

But the code is complete & ready for reviewers to start looking at it.

Comment by Patrick Farrell (Inactive) [ 05/Dec/18 ]

Note also that at least for now, the "trailing components" portion (mentioned in the design doc) is being left out.  I may or may not resurrect that - It added a lot of complexity for maybe not enough benefit, and was preventing me from getting this finished.  The feature is extremely useful without it, so I pushed it anyway.

In any case, the trailing component support is implemented entirely on top of the current patch series, and can be added later if it looks manageable.

Comment by Cory Spitz [ 05/Dec/18 ]

We're still holding to L2.13.0 per http://lustre.org/roadmap/ and http://wiki.lustre.org/Projects so I set the Fix Version/s field accordingly.

Comment by Patrick Farrell (Inactive) [ 13/Dec/18 ]

Update here.

The current version of the patches has no sanity test failures I am aware of.  (Maloo is having some issues right now, but I didn't see any clear evidence the failures were caused by these patches.)

The main remaining to-do is writing the man pages.  I'll do that shortly.

Comment by Gerrit Updater [ 04/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33777/
Subject: LU-10070 tests: Fix replay-single test_85b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0b9fb772e68db7cbf0c8a755092c1d8b5de6b83d

Comment by Gerrit Updater [ 12/Mar/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34406
Subject: LU-10070 tests: Fix replay-single test_85b
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: f28ee4f24a8d147180106ca992b099711d796da8

Comment by Gerrit Updater [ 16/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34406/
Subject: LU-10070 tests: Fix replay-single test_85b
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: e501749abdcf5513eb4a4eb19919bcbd295ad410

Comment by Gerrit Updater [ 20/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34470
Subject: LU-10070 tests: Fix replay-single test_85b
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 58e4fd508397464ab67295641f5dbc7c8b11cc9b

Comment by Gerrit Updater [ 21/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34470/
Subject: LU-10070 tests: Fix replay-single test_85b
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c590eaa213e6333531e775b100bfb78c952f4d79

Comment by Gerrit Updater [ 21/May/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/34909
Subject: LU-10070 utils: SEL: lfs find & getstripe support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f05c0aac0d46d75ed3b3284400cef2b3423f349d

Comment by Patrick Farrell (Inactive) [ 21/May/19 ]

A general comment:
We should add a sanity-lfsck test (probably in a separate patch?), to verify that SEL layouts work with lfsck.  I verified this by hand during development, by leaving all the test generated files in place and running lfsck, but we should have test(s).

A good model is the lfsck tests added in the foreign layout patches:

https://review.whamcloud.com/#/c/33755/

https://review.whamcloud.com/#/c/34087/

I can think of a few cases I'd want to test.

1. Just create an SEL layout, don't do anything to it (or write a little data, but don't use the SEL portion), run sanity-lfsck & verify no changes
2. Same, but extend the layout once, then lfsck (Maybe don't need 1 & 2 - Could just extend once before testing)

3. Same, but "exhaust" an SEL component so you get component removal, then lfsck

4. Repeat a component and run sanity-lfsck

Comment by Cory Spitz [ 06/Jun/19 ]

http://wiki.lustre.org/Release_2.13.0 has been updated with reviewers and testers of record.

Comment by Gerrit Updater [ 10/Jun/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35144
Subject: LU-10070 lod: SEL: interoperability support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2a651609247d5041bbc2c16935396fa445a9bbcd

Comment by Gerrit Updater [ 11/Jun/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35182
Subject: LU-10070 lod: SEL: let's discuss here
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6d4b80add15eb9ad49e8c3fbcf2b830611127c68

Comment by Gerrit Updater [ 14/Jun/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35232
Subject: LU-10070 ldlm: layout lock fixes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9e58fe77666aac33bf78d2f6fb67dee2532e3ae7

Comment by Gerrit Updater [ 19/Jun/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35270
Subject: LU-10070 lod: layout_del memleak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d97c1343529fd4c38be5226772d0f2d9543db854

Comment by Gerrit Updater [ 25/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35232/
Subject: LU-10070 ldlm: layout lock fixes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 51f23ffa4dae3015da627203fb6f160db4911bee

Comment by Gerrit Updater [ 25/Jun/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35314
Subject: LU-10070 utils: setstripe component-add support for SEL
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8a87eab09f17c8efacde106f4c74543dad8669df

Comment by Gerrit Updater [ 27/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35270/
Subject: LU-10070 lod: layout_del memleak
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b0ce7701d1e9dd1269b99b1c660a140fe85b9592

Comment by Gerrit Updater [ 03/Jul/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35414
Subject: LU-10070 lod: SEL cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4f7c529cd73b87e093ffcf539e7e6bf2bbec31d1

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33778/
Subject: LU-10070 tests: New test-framework functionality
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1aaa3e55090c7a5e067ec52cf74b2e6406133d2

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33780/
Subject: LU-10070 lod: SEL: split lod_del_layout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 96bfcd13b6cc3fce12f1e6f5abe4971cc8a59e1f

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33782/
Subject: LU-10070 lod: SEL: Add flag & setstripe support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fed241911f61b1d76aa7d80bfd370c822a3926ef

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33783/
Subject: LU-10070 lod: SEL: Implement basic spillover space
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ff5eb304fa371d879da38621fac3aec7d4548a5e

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33784/
Subject: LU-10070 lod: SEL: Layout sanity checking
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4eca26ddab3186a68888862218fa8904f812e5a1

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33785/
Subject: LU-10070 lod: SEL: Add FLR support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3d7895bf08980530e1e5947a00fc9500f35a55de

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33786/
Subject: LU-10070 lod: SEL: Repeated components
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 76ca884398cae59e455caf3ae2ab1609c5fb1eea

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34909/
Subject: LU-10070 utils: SEL: lfs find & getstripe support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c7cf7a5076440f68dee8f7798f46f50c94404a7e

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35144/
Subject: LU-10070 lod: SEL: interoperability support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c7fe0c7cdae7170e9ec1a6a48423dd20046500e

Comment by Gerrit Updater [ 20/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35314/
Subject: LU-10070 utils: setstripe component-add support for SEL
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e768624788b7b8797c97ca27f2e6f9f63124dba0

Comment by Gerrit Updater [ 20/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35414/
Subject: LU-10070 lod: SEL cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63e90bdb42a9fedd368726877fd4edfaf8e328c7

Comment by Gerrit Updater [ 30/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33779/
Subject: LU-10070 test: llapi_layout_test enhancements
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4a68bcfd8b0d206d5d5e5c18d5fcc8e55d1732b5

Comment by Gerrit Updater [ 06/Aug/19 ]

Vitaly Fertman (c17818@cray.com) uploaded a new patch: https://review.whamcloud.com/35704
Subject: LU-10070 lod: SEL inheritance fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 44e6e8608de68b61b5ce4b904936456dd59003cf

Comment by Gerrit Updater [ 21/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35704/
Subject: LU-10070 lod: SEL inheritance fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dee41b90e74c0c2021207c0431d6c34642c7019c

Comment by Peter Jones [ 25/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 22/Oct/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36554
Subject: LU-10070 utils: move new SEL find_param fields to end
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ba7fe211563652856c481a694d0fdaddad3d8dfe

Comment by Gerrit Updater [ 27/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36554/
Subject: LU-10070 utils: move new SEL find_param fields to end
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5222ba07e3bd087ddb0812e2185610b725cb9d1a

Generated at Sat Feb 10 08:30:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.