Bug 18547 – optimize QOS RR allocator

First Last Prev Next No search results available

Details

Summary:

optimize QOS RR allocator

Bug#:	18547
Alias:	optimize_QOS_RR
Product:	Lustre
Component:	IO (OSC, OST, obdfilter/OFD)
Status:	ASSIGNED
Resolution:

Hardware:	All
OS:	All
Version:	b1_6
Priority:	P5
Severity:	enhancement

URL:

Whiteboard:

Keywords:

2.1-priority

Bug Type:

---

Depends on:

9862 20844

Blocks:

4536 15987 18996 20223

Show dependency tree - Show dependency graph

People

Reporter:	Landen <landen@sun.com>
Assigned To:	James Simmons <simmonsja@ornl.gov>
CC:

Activity

Resolved:	n/a - still opened
First response:	2009-02-17 22:40:32 by landen
Last response:	2010-05-25 11:21:15 by robert.read@oracle.com

Attachments
First pass at optimized QOS disk space allocator (28.97 KB, patch) 2009-10-14 11:58, James Simmons	andreas.dilger: inspection- nathan.rutman: addl.inspection-	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note

You need to log in before you can comment on or make changes to this bug.

Related actions

Description:	Opened: 2009-02-17 22:40

This bug is as the extension of 18334.

------- Comment #1 From Landen 2009-02-17 22:45:57 -------

Adilger's comment (transferred from bug 18334 comment25):

Once QOS is active, why did setting lq_prio_free = 0 did not result in "mostly" even object
allocation (i.e. consider the loading at least as important as available space)?  It seems in the
tests done by Cray that the OST selection ratio almost did not change very much at all whith
lq_prio_free = 0, and that is wrong. We need to verify with decreasing lq_prio_free that the
available space contributes less to the overall weighting.  Even with the larger
"qos_threshold_free" eventually one of the OSTs will almost run out of space, and then user
applications will begin to fail if they are still allocating on this OST instead of limiting
themselves to OSTs that are free.



I don't think having a tunable for "qos_threshold_free" is necessarily bad, but
I don't think it will fix this problem in the long term, and instead we need to
understand why QOS isn't doing a better job.  It may be that simply changing
the threshold from 20% of the least full OST to, say, 75% of the most full OST
will be enough to avoid QOS in most cases.  Having a large difference when
the OSTs are mostly empty isn't harmful, and when some OSTs will be very close
to full this will trigger QOS regardless.  This will avoid QOS mode for most
cases (which is really what the customer wants) but still allow QOS mode to be
enabled to avoid hitting ENOSPC when there is still lots of space.

------- Comment #2 From Landen 2009-02-17 22:48:52 -------

There is a lot of discussion at bz18334 about this topic. Please refer to bz18334 for the more
details.

------- Comment #3 From Andreas Dilger 2009-02-18 16:42:44 -------

More comments from original bug:

I have long thought that in qos_prep_create() that the "if (!bavail)" check
to only skip OSTs that have 0 free space is too weak, and will result in
applications hitting ENOSPC on OSTs that have almost no free space because
there is statistically always some chance of them being selected for object
allocation. Instead, think this should be something like:

if (bavail < obd->osfs.os_bavail >> 5 / num_active)

(i.e. if this OST has less than 3% of the space of an average OST it should
be skipped entirely, rather than having the risk of objects being allocated
there).

Of course, as soon as you give a fixed number (3%) somebody will want
something different.

We want the OSTs to be evenly filled for as much of their lives as possible,
to be as evenly loaded (disk and network io) as possible, for optimum
filesystem performance. If the weighting was subtle e.g. round-robin, but skipping more-full OSTs
1/n of the times (where n is proportional to the
imbalance) then it could essentially be in use all of the time. If we implemented QOS as a part of
the round-robin allocator that is always
active, just by skipping OSTs periodically (rather than randomly) it would
provide much more uniform loading and would prevent the OSTs from getting far
out of balance in the first place.

Say the OSTs are 100MB in size (1% = 1MB) and OST10-19 have 9x as much free
space as OST0-9 then the RR QOS selection would skip OST0-9 about 8/9 times,
but never skip OST10-19. If each file created was 1MB in size then the first
11 files would use 1MB on each of the 10 emptier OSTs, and 1MB on one of the
fuller OSTs. Repeat 10x and 10MB is used uniformly on each of the emptier
OSTs, but only 1MB on each of the fuller OSTs.

This is what we tried to do with the random QOS weighting, but just as with
the original LOV allocator (which randomly picked OSTs instead of round-robin)
the chance of having an imbalance (e.g. average 2 objects/OST but some had 0
and some had 4) was too high.

One suggestion is to use an "accumulated error" mechanism, where each of the
OSTs is present in the RR list (in the optimized OST/OSS order it normally
is), and each time it is skipped from the RR selection (because the OST
available space is < threshold or average) the OST free space is added
to a "bonus" counter for that OST. When the OST (available space + bonus)
is large enough the OST will be picked by normal RR selection and then its
bonus reset to 0 and the threshold will be again be too low to select it.

The MDS can estimate the average file size from the statfs data, so we might
increment the bonus by the average file size? One thing we might want
to do is initialize the bonus with some random amount (proportional to the
free space?) at MDS startup so that some of the full OSTs become available
at any time instead of all at once.

It seems we should also include some fraction (maybe 1/2, tunable via
qos_prio_free) of the bonus value into the average so that less-used OSTs
(with higher bonus) are more likely to be used instead of all being chosen
equally. The problem with having equal weights (i.e. not taking the
bonus into account for $average_free) is that e.g. the OST immediately
following the new one will be picked disproportionately often.

In the above example, if the file size is 1 then we skip the full OSTs
9 times each increasing their bonus by 9, and filling the new OST by
approximately 9, giving it a negative bonus (penalty) of 9. Penalty can be
reset at next statfs for this OST, but bonus not. Then $average_free is
10 * 9 + 91 = 18 and all of the OSTs are in the running.

For N-striped files one idea I had was to subtract the just-used OST's space
from the current file's "average" so that the remaining OSTs become
proportionally more attractive. Continuing in the above example, but with
a 2-stripe file, the current 2-stripe file's $average_free is now 10, and
any of the OSTs can be selected.

Keeping the filesystem total available space as a running total updated only
in the statfs interpreter callbacks (without having to iterate over 1000 OSTs
to compute it each time) is fairly important to efficiency.

If one of the full OSTs had been previously selected then its bonus = -1,
and the rest have bonus = 1, so the average is 10.9. The just used OST
only has weight 9 and will not be selected this round. One OST will be
picked and its weight set to -1, and the remainder get bonus = 2, so
average = 11.7 for next round, and only OSTs with weight 12 can be selected
for the second stripe.

For 1-stripe files (using 1/2 weighting of the bonus in the average) we
would need to allocate 14 files on the free OST before allocating a file on one of the other OSTs.

We need a simulator that can just plug in lov_qos.c so that we can both
analyze the current code, and secondly validate any changes made. In other
projects this has been done by adding unit-test code at the bottom of this
file, and it is "#ifdef UNIT_TEST" so it is not in normal compiles, but
creates a test executable when compiled with -DUNIT_TEST.

------- Comment #4 From Jessica Johnson 2009-06-25 15:28:30 -------

Updating Lustre 2.0 project priority: P5.

------- Comment #5 From James Simmons 2009-08-07 11:47:16 -------

Begin investigation. With permission I like to assign this bug to myself. While examining the code
do we want to also apply this balancing to alloc_specific as well. With alloc_specific the first
stripe
offset must be picked but all other offsets afterwards could be optimized much like the round robin
case. The second difference I noticed is that for obd_precreate in lov_qos.c the passing parameters
vary. In alloc_specific, which makes sense, we have 

if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) &&
    (i != 0 || speed >= 2))
       continue;

alloc_rr:

if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed)
        continue;

alloc_qos:

if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2)
        continue;

Which is correct?

Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the alloc_qos case.
This seems incorrect?

------- Comment #6 From Andreas Dilger 2009-08-07 15:18:31 -------

(In reply to comment #5)
> While examining the code do we want to also apply this balancing
> to alloc_specific as well.  With alloc_specific the first stripe
> offset must be picked but all other offsets afterwards could be
> optimized much like the round robin case.

I suspect this would be OK.  We don't strictly guarantee that the
alloc_specific() case follows sequential OST ordering, but it has
in the past only skipped OSTs that are unavailable.

If the algorithms for the optimized QOS round-robin allocator are
good enough, I would want to just move to a single QOS-aware
allocator for everything.  That would do the best job of keeping
the space usage (along with load, RAID state, bandwidth, etc in
the future) balanced all the time.

> The second difference
> I noticed is that for obd_precreate in lov_qos.c the passing
> parameters vary. In alloc_specific, which makes sense, we have 
> 
> if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) &&
>     (i != 0 || speed >= 2))
>        continue;
> 
> alloc_rr:
> 
> if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed)
>         continue;
> 
> alloc_qos:
> 
> if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2)
>         continue;

I think the differences are patially related to what is being requested
by the user.  In the alloc_specific() case we don't want to skip the
requested OST unless it is dead, but for following stripes (i != 0) they
may skip OSTs if they are slower than others (i.e. objects not being created).
For other allocation modes we don't care about which OST the first object
is placed on (hence no (i != 0 || speed >= 2) check).

In the alloc_qos() case it is focussing more on space usage, and less on
the fact that one OST might be slow (e.g. speed == 1).

> Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the
> alloc_qos case.  This seems incorrect?

Well, with your patch in the other bug to have OSCC_FLAG_NOSPC set if
the OST is out of inodes or space then the OSTs should never be
selected in the first place.  What is better at the LOV level is
to do the "is this OST below some threshold vs. other OSTs then skip"
check, rather than the current "sort based on fullness" algorithm.

In addition to using the available space as part of the threshold,
the code should be designed to allow other weights to be included
(e.g. available bandwidth/IOPS).  The current QOS code imposes a
"penalty" on recently-allocated OSTs based on a fraction of the
total space on the OST, but I suspect this would be better handled
by imposing a penalty based on the average file size in the
filesystem and/or recently created files based on a running average
of the statfs data.

On keeping a running average, it makes sense to use a decaying average
that takes the total space consumed over the number of files created,
based on the delta between two statfs calls, for longer intervals
that the statfs data changed (i.e. not based on time).

------- Comment #7 From Nathan Rutman 2009-08-07 16:05:12 -------

(In reply to comment #5)
> Begin investigation. With permission I like to assign this bug to myself. While examining the code
> do we want to also apply this balancing to alloc_specific as well. With alloc_specific the first
> stripe
> offset must be picked but all other offsets afterwards could be optimized much like the round robin
> case. The second difference I noticed is that for obd_precreate in lov_qos.c the passing parameters
> vary. In alloc_specific, which makes sense, we have 
> 
> if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) &&
>     (i != 0 || speed >= 2))
>        continue;
                 * This means "if OSC is slow and it is not the requested
                 * start OST, then it can be skipped, otherwise skip it only
                 * if it is inactive/recovering/out-of-space." */
0=fast
1=med
2=slow
>2=unusable

Since we only speed++ if we can't find enough faster osts, and the maximum speed is 2
        if (speed < 2) {
                /* Try again, allowing slower OSCs */
                speed++;
it looks like that is wrong, and we shouldn't check speed in the if: 
if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) && (i > 0))


> alloc_rr:
> 
> if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed)
>         continue;
that is right

> 
> alloc_qos:
> 
> if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2)
>         continue;
here we are not doing a "repeat_find" to choose the fastest OSTs.  We probably should.

> 
> Which is correct?
> 
> Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the alloc_qos case.
> This seems incorrect?
well, that is what alloc_qos is all about.  The rr allocator does not care about fullness, while
the qos allocator does.  The qos allocator is used until the osts are mostly balanced, at which
point the rr allocator is used, so there's no need to use fullness measure there.

But there are so many questions of when to use osts, I think it would be best to remove these three
different allocators and rewrite from scratch.  It should be mostly rr, specific is just rr with a
given start index, and qos should be included as Andreas describes above (skipping more-full OSTs
1/n of the time).  precreate status should also be taken into account, as well as active/inactive
status.  

0. create RR pattern, based on currently active OSTs. I think the current pattern is good.
Recompute the pattern as OSTs are added, removed, activated, deactivated.

Iterating throught the pattern, starting at start_ost, and pick OSTs based on usage factors:
1. inactive.  Never assign stripes
2. osc precreate status 1000. Never use.
3. osc precreate status 1-2.  Prefer not to use.  (we can do the "try again at a lower speed if we
can't get enough at the current speed" method currently employed.)
4. qos-space.  if an OST has less free space than average by x%, try to skip it y% of the time. 
Details are in Andreas' comments above.
5. qos-server.  if an OSS has fewer OSTs than average, it will not have as much network traffic on
it (based on strict rr).  So the idea is to increase the number of stripes we assign to low-OST
servers (or inversely, if an OSS has more OSTs than average by x%, try to skip it y% of the time.
This is what the prio_free currently does.)

If we can't get enough OSTs as requested, relax some constraints (precreate value, qos skipping)
and see if we can.

Occasionally move start_ost around to avoid edge cases.  This can get extremely tricky (when to
skip, how much to skip.) We are trying to avoid cases where users writing N-stripe files on M OSTs
always end up with stripe 1 on the same OST. (For example, if you always skip one stripe out of M,
then you're bad when N=M-1.)

This is why a simulator (last part of comment #3) would be extremely useful.

------- Comment #8 From James Simmons 2009-08-10 10:28:15 -------

For alloc_rr we have 

               *idx_pos = ost_idx;
                idx_pos++;
                /* We have enough stripes */
                if (idx_pos - idx_arr == *stripe_cnt)
                        break;
        }
        if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) {
                /* Try again, allowing slower OSCs */
                speed++;
                lqr->lqr_start_idx = ost_start_idx_temp;
                goto repeat_find;
        }

Whereas in alloc_specific we have

                *idx_pos = ost_idx;
                idx_pos++;
                /* We have enough stripes */
                if (idx_pos - idx_arr == lsm->lsm_stripe_count)
                        GOTO(out, rc = 0);
        }
        if (speed < 2) {
                /* Try again, allowing slower OSCs */
                speed++;
                goto repeat_find;
        }

For the round robin case even if we find a successful case we end up testing for a OST at a slower
speed.

------- Comment #9 From Nathan Rutman 2009-08-10 10:42:57 -------

(In reply to comment #8)
> For alloc_rr we have 
>                 /* We have enough stripes */
>                 if (idx_pos - idx_arr == *stripe_cnt)
>                         break;
>         }
>         if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) {
>                 /* Try again, allowing slower OSCs */
>                 speed++;
>                 lqr->lqr_start_idx = ost_start_idx_temp;
>                 goto repeat_find;
>         }
> For the round robin case even if we find a successful case we end up testing for a OST at a slower
> speed.
No, because of the second half of the conditional.  idx_pos - idx_arr is the number of allocated
OSTs; we only speed++ if we don't have at least stripe_cnt_min. 

But clearly three slightly different codings is evidence for a unified approach anyhow.

------- Comment #10 From James Simmons 2009-08-10 11:18:12 -------

(In reply to comment #9)
> (In reply to comment #8)
> > For alloc_rr we have 
> >                 /* We have enough stripes */
> >                 if (idx_pos - idx_arr == *stripe_cnt)
> >                         break;
> >         }
> >         if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) {
> >                 /* Try again, allowing slower OSCs */
> >                 speed++;
> >                 lqr->lqr_start_idx = ost_start_idx_temp;
> >                 goto repeat_find;
> >         }
> > For the round robin case even if we find a successful case we end up testing for a OST at a slower
> > speed.
> No, because of the second half of the conditional.  idx_pos - idx_arr is the number of allocated
> OSTs; we only speed++ if we don't have at least stripe_cnt_min. 

So in the case of round robin EBIG is not possible? I'm considering the case where 
idx_pos - idx_arr is always less then stripe_cnt_min.

------- Comment #11 From James Simmons 2009-09-04 11:34:02 -------

> 5. qos-server.  if an OSS has fewer OSTs than average, it will not have as much network traffic on
> it (based on strict rr).  So the idea is to increase the number of stripes we assign to low-OST
> servers (or inversely, if an OSS has more OSTs than average by x%, try to skip it y% of the time.
> This is what the prio_free currently does.)

I tested a small simulator for selecting disk for this case. Now there are two ways to approach
this.
One is in alloc_rr ost_count loop we could skip the disk for the balancing. The second options is
in 
qos_calc_rr we create a optimized lqr_pool.

The pro for the second approach is that this special lqr_pool is created only when a new OST is
added or substracted to a OSS. Thus eliminating the the need to do a balancing calculation when we
are allocating stripes. The con is the lqr_pool can be larger than the original ost_pool. Playing
with the simulator I found this ratio can be very large in the case of where we have very few OSSs
with lots of OSTS and we add a new OSS with only one OST. This case causes the most severe
imbalance. If a new OSS is added the imbalance will be less as the number of OSTs for this new OSS
approaches the small amount as the other OSSs. Here are a few examples. Each OSS has the same
number of disk. The second case we compare is when one OSS loses all OSTs except one which is
the worst case.

4 OSS with 2 OSTS. Pool size is 8. Imbalance case the pool size is 10.
4 OSS with 256 OSTS. Pool size is 1024. Imbalance case the pool size is 66304. 65X ratio.
4 OSS with 4096 OSTs. Pool size is 16384. Imbalance case the pool size of 16789504. 1025X ratio.

64 OSS with 2 OSTS. Pool size is 128. Imbalance case the pool size is 130.
64 OSS with 256 OSTS. Pool size is 16384. Imbalance case the pool size is 81664. 5X ratio.
64 OSS with 4096 OSTS. Pool size is 262144. Imbalance case the pool size is 17035264. 65X ratio.

256 OSS with 2 OSTS. Pool size is 512. Imbalance case pool size is 514.
256 OSS with 256 OSTS. Pool size is 65536. Imbalance case the pool size is 130816. 2X ratio.
256 OSS with 4096 OSTS. Pool size 1048576. Imbalance 17821696. Ratio of 17X

As you can see the more OSS, even with lots of disks, the impact of the imbalance shrinks. Also the
less OSTS per OSS the quicker the imbalance approaches the size of the original pool size.

------- Comment #12 From Andreas Dilger 2009-09-08 01:16:30 -------

James Simmons <simmonsja@ornl.gov> wrote at 2009-09-04 11:34:02:
> I tested a small simulator for selecting disk for this case. Now there
> are two ways to approach this.  One is in alloc_rr ost_count loop
> we could skip the disk for the balancing. The second options is in
> qos_calc_rr we create a optimized lqr_pool.

My thinking on this so far has been that we would dynamically skip
entries while walking the OST index array, based on the current
"weight" of each particular OST.  The weight is potentially made up
of many inputs.  Currently the weight is the OST available space, less
any penalty for recent allocations, but we may also want to consider
having credits that are added onto the OST's weight if it has been
skipped enough times.

There is a per-OST penalty and a per-OSS penalty, which are very rough
estimates for the amount of space that will be consumed on the OST and
the remaining IO bandwidth of the OSS.  I want it to be flexible enough to
include other weighting factors into the mix (e.g. actual IO bandwidth or
latency, performance impact due to rebuilding, administrator tuning, etc).

> The pro for the second approach is that this special lqr_pool is
> created only when a new OST is added or substracted to a OSS. Thus
> eliminating the the need to do a balancing calculation when we are
> allocating stripes.

Maybe I'm missing something, but even when OSTs are not being added
or removed from the OSS there will be a continous need for balancing.
Various factors like current available space, recent allocation's
using bandwidth, etc are constantly changing and need to be handled.

That, IMHO, makes having a static list impractical, and the allocation
should instead be recomputing weights frequently, and deciding whether
an OST is suitable or not by one (or a few) simple threshold checks to
decide whether the OST will be used or not.  It is important to design
the algorithm in a way that it can be incrementally updated as new data
arrives (e.g. statfs refresh) if at all possible, instead of requiring a
global refresh, because in very large filesystems (i.e. Spider) there
are a LOT of OSTs and rescanning all of them would be inefficient.

For example, for the available space tracking I would imagine computing
using the average available space on all OSTs as a threshold, and to
compute the sum(available space) over all OSTs it would track the old
available space value, and then when a new statfs arrives it adds the
difference between the old and new values into the sum.

As OSTs in that pool are selected for an allocation they will get a
temporary penalty added which is the "average file size / num_stripes",
but after some time (not sure how long) the temporary penalty is
replaced by the actual statfs data (maybe with a decaying penalty
average).

------- Comment #13 From James Simmons 2009-09-08 09:17:03 -------

> > The pro for the second approach is that this special lqr_pool is
> > created only when a new OST is added or substracted to a OSS. Thus
> > eliminating the the need to do a balancing calculation when we are
> > allocating stripes.
> 
> Maybe I'm missing something, but even when OSTs are not being added
> or removed from the OSS there will be a continous need for balancing.
> Various factors like current available space, recent allocation's
> using bandwidth, etc are constantly changing and need to be handled.

This is only dealing with the case for OSS having different numbers of OSTs.
For the case of say a system with 2 OSS. One OSS has 2 OSTs and the other OSS
has 4 OSTs. For this case the OSS with less OSTs will have to be selected twice
as much to balance the data being processed. This is of course assuming both OSSs
have the same network bandwidth. At this point I don't see a easy way to figure
that out. But to make things simple lets assume network bandwidth is the same 
going to both disk. This is not to say that we are not going to handle the case
of the the disk space being filled up. Assuming in the above case all disk are
equal in size then the OSS with less OSTs will fill faster thus a counter weight
that needs to be calculated during each striping to balance it out. Now if the
OSS that had 2 OSTs was twice as large as the other disk then if all disk where
being hit at the same time then the system would be very balanced.
    So I'm not removing the weighing while striping calculations. Just attempting
to figure out the optimal arrangement of the pool before any striping begins. What
got me thinking about this was the fact alloc_specific doesn't arrange the osts to 
optimize the OSS bandwidth like RR does. Now if we handle the case of unbalanced
network bandwidth then its a different ball game. Is there anyway to know this from
the lnet layer?

------- Comment #14 From Andreas Dilger 2009-09-10 02:41:21 -------

(In reply to comment #13)
> This is only dealing with the case for OSS having different numbers
> of OSTs.  For the case of say a system with 2 OSS. One OSS has 2 OSTs
> and the other OSS has 4 OSTs. For this case the OSS with less OSTs
> will have to be selected twice as much to balance the data being processed.

That would be needed to balance the network bandwidth, yes.  It
wouldn't be needed for balancing the space usage or the disk IO
bandwidth (assuming that each of the OSTs has the same IO
bandwidth).  Ideally, we would have both.  I think the most common
setup is where each OSS has the same number of OSTs, but they
are not the same size.  As a first-level approximation it should
be assumed that the disk IO bandwidth of the OSTs is the same.

> Assuming in the above case all disk are equal in size then the
> OSS with less OSTs will fill faster

I don't think this is correct.  Same-sized OSTs should individually
fill at about the same speed, if the objects are uniformly allocated
and the file sizes are also the same.  It does mean that the OSS
bandwidth usage is not balanced.

> So I'm not removing the weighing while striping calculations.
> Just attempting to figure out the optimal arrangement of the pool
> before any striping begins. What got me thinking about this was the
> fact alloc_specific doesn't arrange the OSTs to optimize the OSS
> bandwidth like RR does.

Well, if someone is using alloc_specific (which means they have
specified a starting OST index, which is not very common) then
the assumption is that the user wants specific OSTs to be used.

> Now if we handle the case of unbalanced network bandwidth then
> its a different ball game. Is there anyway to know this from
> the lnet layer?

Each OST keeps stats locally for the IOs, including the bytes
read and written and the service time.  This could be extracted
and returned to the MDS as part of the statfs request.  While
this is a desirable feature, and it is worthwhile to plan for
additional weighting factors in the new allocator code, I don't
think it is required to implement this in the first version.

One possibility is to start with some static bandwidth value
per OSS and OST (e.g. 1GB/s and 200MB/s respectively) and
then base decisions on these values, and then later implement
the functionality that will make these "real" values.

------- Comment #15 From Nathan Rutman 2009-09-10 16:42:08 -------

from bug 9862, rr allocator for different-OST-count OSS's looks like this:
(the same letter represents the different OSTs on a single OSS) 
3:   AAA
3x3: ABABAB
3x4: BBABABA
3x5: BBABBABA
3x5x1: BBABABABC
3x5x2: BABABCBABC
4x6x2: BABABCBABABC
I think this is the "optimized lqr_pool" James refers to in comment #11, and my 0) from comment #7.
 New file asks for X stripes, we move a pointer along this list and give it the X stripes, and
leave the pointer pointing to the next stripe to be parceled out.  This rr pool is (currently) only
used if space on all OSTs are approximately equal. 
The way I see it, this should be the starting point for the OST index array, of which Andreas
thinks "we would dynamically skip entries while walking the OST index array" in comment #12.
In other words, the array that we're starting from is not simply the array of OST indexes, which is
just the order in which the OSTs were added.  Starting it as above means ignoring any other
weighting factors, we load the OSS's as evenly as possible.

An additional weighting factor/penalty for lightly/heavily OST'ed OSS's is independent of this, and
should depend on relative network/disk performance (and hence is tunable in the current
implementation).

Other things to be clear about here:
1. Files may have multiple stripes.  That's the only reason for the above sorting; if all files had
a stripecount of 1 the entire above argument doesn't apply - assuming large number of files
relative to number of OST's, we will access all OSTs equally.  But for a file with stripecount > 1,
we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS
weighting in current code.  This actually should be refined - it should be
recently-used-within-this-file's-stripe-allocation).
2. Not all stripes are used.  If a file is given a stripecount of 5, but the file is only 3MB long,
then 2 stripes are empty.  This comes into play if num OSTs mod default stripecount = 0, so that
some OSTs never get filled stripes.  We had a "precessor" value that would periodically skip a
stripe to avoid this in the past, but it could probably be abandoned given we now check free space.

Something that bothers me about using weights at all - we're comparing apples to oranges to
kumquats with a weighting algorithm, and some people are kumquat haters.  There needs to be a
uniform, clear way to allow users to choose the relative factors.  All need to be able to be dialed
to 0 if desired.
OST   %full free_space(MB)  OSS#  precreate_status disk_speed(MB/s) network_speed(MB/s)
0000  60%   500             1     2                40               20
0001  20%   100             2     0                80               50
0002  50%   500             1     1                20               20
Which ost do we use next?  What is our goal?  Greatest average transfer time?  Uniform free space?
(e.g. I actually want stripe #0 on my SSD drive for all files so that Windows file browser can draw
icons quickly.  If I bump up my weighting for disk speed, will more stripes on average live there?)
 Maybe this is all better done with pools, and we just have hard limits for free space imbalance
and deal better with that case only (comment #3). We need to clarify the scope of what we are
trying to accomplish here.

------- Comment #16 From James Simmons 2009-10-14 11:58:55 -------

Created an attachment (id=26578) [details]
First pass at optimized QOS disk space allocator

   I apologize for the delay. I was working on some other projects as well. This is a proof of
concept patch for peer review. As you can see I went with the weight testing in the loop instead of
a specially created OST index array. It really is just a basic weighted RR. Any OST below the
weight is not used. The range of the weight is from the least available spaced OST to the most
freely available spaced OST. In the pure round robin case we set the weight at the smallest disk to
ensure no one gets skipped. In qos space mode we go from the most full to the least full.
I removed the penalties for now, plus I haven't tested at scale yet. Test 116 passed of course and
well as regular sanity test. I also test at greater imbalances. Just using test 116 settings I get
a average of 350% more writes to the more full disk. When I used a 90% full OST the writes where
2500% more to the emptier disk.

------- Comment #17 From James Simmons 2009-10-14 13:00:18 -------

> Other things to be clear about here:
> 1. Files may have multiple stripes.  That's the only reason for the above sorting; if all files
> had a stripecount of 1 the entire above argument doesn't apply - assuming large number of files
> relative to number of OST's, we will access all OSTs equally.  But for a file with stripecount > > 1, we absolutely want to spread them out across OSS's as much as possible (hence the > 
> recently-used-OSS weighting in current code.  This actually should be refined - it should be
> recently-used-within-this-file's-stripe-allocation).

I was thinking the same thing. Currently I have addressed the issue with disk space balancing. What
I noticed is that if one disk in the system is added and it is empty that disk with be hit hard.
Hence all OSS are not used evenly. Now the greater imbalance the faster it balanced all the disks
which means a flood to the OSS that the disk is on. Now say if the OSS is connected by a higher
output network connection, thus can handle more data being thrown at the OST, then this becomes
less of a problem. Now if the opposite is true then the other OSS nodes have to be heavily favored.
So data rates into the OSS matter. I just haven't translated that into code yet.

> 2. Not all stripes are used.  If a file is given a stripecount of 5, but the file is only 3MB 
> long, then 2 stripes are empty.  This comes into play if num OSTs mod default stripecount = 0, so > that some OSTs never get filled stripes.  We had a "precessor" value that would periodically skip
> a stripe to avoid this in the past, but it could probably be abandoned given we now check free 
> space.

That should not be the case any longer. As for the "precessor" value I assume you mean
lqr_start_idx. Now it makes sense why you had it there in the first place. I guess we can remove
it now. 

> Something that bothers me about using weights at all - we're comparing apples to oranges to
> kumquats with a weighting algorithm, and some people are kumquat haters.  There needs to be a
> uniform, clear way to allow users to choose the relative factors.  All need to be able to be dialed
> to 0 if desired.
> OST   %full free_space(MB)  OSS#  precreate_status disk_speed(MB/s) network_speed(MB/s)
> 0000  60%   500             1     2                40               20
> 0001  20%   100             2     0                80               50
> 0002  50%   500             1     1                20               20
> Which ost do we use next?  What is our goal?  Greatest average transfer time? Uniform free space?

Currently the goal is uniform free space. Also we want to maximize loading as well. I agree it
needs
to use admin tunable. Thus we have rr_threashold and I need to get prio_space working.

------- Comment #18 From Andreas Dilger 2009-10-18 22:46:48 -------

(From update of attachment 26578 [details])
Inspection template(s):
Bug:       18547 
Developer: simmonsja@ornl.gov 
Size:      173 Lines of Change 
Date:      2009-10-14
Defects:   
Type:      CODE
Inspector: adilger@sun.com

--------------
>@@ -683,6 +683,9 @@ struct lov_qos {
>+        __u64               lq_ba_max;
>+        __u64               lq_ba_min;
>+        __u64               lq_hcf;

(style) Please add comments for these fields.  In particular, what
does "hcf" mean?

Can you please add a big comment block to this file which explains
the algorithm that you have implemented.  I think from your description
of the "overloading" of the most-empty OST(s) that you have missed out on
some important features of the previous algorithm, namely that it would
also add a penalty to the just-selected OSTs and OSSes so that they
would not immediately be selected again.  The penalty is reduced as other
OSTs/OSSes are used for allocations, and it also decays with time, so the
most-empty OST(s) will still be selected more often than the others.

>+/* Greatest common divisor */
>+unsigned long gcd(unsigned long a, unsigned long b)
>+{
>+        unsigned long r;
>+
>+        if (a < b)
>+                swap(a, b);
>+        while ((r = a % b) != 0) {
>+                a = b;
>+                b = r;
>+        }
>+        return b;
>+}

This seems like a relatively expensive operation - how often is this
called?

> static int qos_calc_ppo(struct obd_device *obd)
> {
>+        down_write(&lov->lov_qos.lq_rw_sem);
>         if (!lov->lov_qos.lq_dirty)
>                 GOTO(out, rc = 0);
> 
>         num_active = lov->desc.ld_active_tgt_count - 1;
>         if (num_active < 1)
>+                GOTO(out, rc = 0);
> 
>+        /* Count the number of usable OSS's */
>+        lov->lov_qos.lq_active_oss_count = 0;
>         /* find bavail on each OSS */
>         list_for_each_entry(oss, &lov->lov_qos.lq_oss_list, lqo_oss_list) {
>+                if (oss->lqo_ost_count)
>+                        lov->lov_qos.lq_active_oss_count++;
>                 oss->lqo_bavail = 0;
>         }
> 
>         /* How badly user wants to select osts "widely" (not recently chosen
>            and not on recent oss's).  As opposed to "freely" (free space
>            avail.) 0-256. */
>         prio_wide = 256 - lov->lov_qos.lq_prio_free;
> 
>+        lov->lov_qos.lq_ba_max = 0;
>+        lov->lov_qos.lq_ba_min = (__u64)(-1);
>         now = cfs_time_current_sec();
>         /* Calculate OST penalty per object */
>         /* (lov ref taken in alloc_qos) */
>@@ -189,12 +210,15 @@ static int qos_calc_ppo(struct obd_devic
>                 temp = TGT_BAVAIL(i);
>                 if (!temp)
>                         continue;
> 
>+                lov->lov_qos.lq_ba_max = max(temp, lov->lov_qos.lq_ba_max);
>+                lov->lov_qos.lq_ba_min = min(temp, lov->lov_qos.lq_ba_min);
>+                if (i)
>+			lov->lov_qos.lq_hcf = gcd(temp, lov->lov_qos.lq_hcf);
>+                else
>+			lov->lov_qos.lq_hcf = temp;
>+
>+                lov->lov_tgts[i]->ltd_qos.ltq_weight = temp;
>                 lov->lov_tgts[i]->ltd_qos.ltq_oss->lqo_bavail += temp;
> 
>                 /* per-OST penalty is prio * TGT_bavail / (num_ost - 1) / 2 */
>@@ -250,7 +274,9 @@ static int qos_calc_ppo(struct obd_devic
>         /* If each ost has almost same free space,
>          * do rr allocation for better creation performance */
>         lov->lov_qos.lq_same_space = 0;
>+        if ((lov->lov_qos.lq_ba_max * (256 - lov->lov_qos.lq_threshold_rr)) >> 8 < 
>+             lov->lov_qos.lq_ba_min) {
>+                /* This will force use of all disk in RR */
>                 lov->lov_qos.lq_same_space = 1;
>                 /* Reset weights for the next time we enter qos mode */
>                 lov->lov_qos.lq_reset = 1;
>@@ -258,99 +284,9 @@ static int qos_calc_ppo(struct obd_devic
>         rc = 0;
> 
> out:
>+        up_write(&lov->lov_qos.lq_rw_sem);
> 
>+        RETURN(rc);
> }
> 

>+static int alloc_stripe(struct lov_obd *lov, struct ost_pool *osts,
>+                        int *idx_arr, int array_start, int *stripe_cnt,
>+                        int stripe_cnt_min, struct lov_qos_rr *lqr)
>+{
>+repeat_find:
>+	if (lov->lov_qos.lq_same_space)
>+        	weight = lov->lov_qos.lq_ba_min;
>+        else
>+                weight = lov->lov_qos.lq_ba_max;
>+        array_idx = array_start;
>+        idx_pos = idx_arr;
>+
>+	do {

(style) please use spaces for indenting instead of tabs (in several places).

>+/* return new alloced stripe count on success */
>+static int alloc_idx_array(struct obd_export *exp, struct lov_stripe_md *lsm,
>+                           int newea, int **idx_arr, int *arr_cnt, int flags)
> {
>+
>+        pool = lov_find_pool(lov, lsm->lsm_pool_name);
>         if (pool == NULL) {
>-                osts = &(lov->lov_packed);
>                 lqr = &(lov->lov_qos.lq_rr);
>+                osts = &(lov->lov_packed);

No need to change the order of these, and it is more consistent to
keep the assignmentsin the same order in the "else" clause.

>         } else {
>                 down_read(&pool_tgt_rw_sem(pool));
>                 osts = &(pool->pool_obds);
>                 lqr = &(pool->pool_rr);
>         }
> 
>+        rc = qos_calc_rr(lov, osts, lqr);
>+        if (rc)
>+            	GOTO(out, rc);
>+        osts = &lqr->lqr_pool;
> 
>         rc = qos_calc_ppo(exp->exp_obd);
>         if (rc)
>                 GOTO(out, rc);

Calling qos_calc_ppo() for every object allocation could be very
expensive.  It doesn't seem that this has been implemented in an
incremental manner, and I think in a case like Spider with 1300+
OSTs, it might be too much.


> 
>+        if (newea ||
>+            lsm->lsm_oinfo[0]->loi_ost_idx >= lov->desc.ld_tgt_count)
>+                rc = alloc_rr(lov, tmp_arr, &stripe_cnt, osts, lqr, flags);
>+        else
>+                rc = alloc_specific(lov, lsm, tmp_arr, osts);

------- Comment #19 From Nathan Rutman 2009-10-19 13:21:52 -------

James A Simmons wrote:
> For bug 18547 you talked about refining the weighting code with a
> recently used within ths file's stripe allocation. Do you mean using
> the time difference between when the stripe was created and when the
> stripe was filled with data?
No, recently-used weighting is about ordering, not time.  A recently used OSS is one that had a
stripe allocated from it more recently than other OSS's.  The paragraph I think you're referring to
is  this:

1. Files may have multiple stripes.  That's the only reason for the above sorting; if all files had
a stripecount of 1 the entire above argument doesn't apply - assuming large number of files
relative to number of OST's, we will access all OSTs equally.  But for a file with stripecount > 1,
we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS
weighting in current code.  This actually should be refined - it should be
recently-used-within-this-file's-stripe-allocation).

The current code does this:
When allocating a stripe from an OSS (i.e. any OST on that OSS), add a penalty to that OSS.  Decay
that penalty as stripes are allocated from other OSS's.  When calculating the weight for any OST,
include the per-OSS penalty for it's OSS in the calculation.

What it should be is this:
When allocating a stripes for any single file, try to spread them across OSS's.
Using my nomenclature from comment #15,
OSTs: AABBBB
file1: ABAB
file2: AB
file3: B
file4: B
That is, ignoring any other reasons for allocating on or avoiding a particular OST, we should
spread stripes for individual files across OSS's so that network bandwith is more frequently
parallelized.  Try to avoid allocations like:
file1: BBBB
file2: AA

------- Comment #20 From James Simmons 2009-10-19 13:38:56 -------

> >@@ -683,6 +683,9 @@ struct lov_qos {
> >+        __u64               lq_ba_max;
> >+        __u64               lq_ba_min;
> >+        __u64               lq_hcf;
> 
> (style) Please add comments for these fields.  In particular, what
> does "hcf" mean? 

Highest common factor or also know as greatest common factor.

> Can you please add a big comment block to this file which explains
> the algorithm that you have implemented.  I think from your description
> of the "overloading" of the most-empty OST(s) that you have missed out on
> some important features of the previous algorithm, namely that it would
> also add a penalty to the just-selected OSTs and OSSes so that they
> would not immediately be selected again.  The penalty is reduced as other
> OSTs/OSSes are used for allocations, and it also decays with time, so the
> most-empty OST(s) will still be selected more often than the others.

No, I only did disk balancing so far. Next is OSS balancing.

> >+/* Greatest common divisor */
> >+unsigned long gcd(unsigned long a, unsigned long b)
> >+{
> >+        unsigned long r;
> >+
> >+        if (a < b)
> >+                swap(a, b);
> >+        while ((r = a % b) != 0) {
> >+                a = b;
> >+                b = r;
> >+        }
> >+        return b;
> >+}
> 
> This seems like a relatively expensive operation - how often is this
> called?

When QOS is enabled and for each OST. I found using the GCD reduces the
loops in alloc_stripe. I have been pondering another approach.

> >+static int alloc_stripe(struct lov_obd *lov, struct ost_pool *osts,
> >+                        int *idx_arr, int array_start, int *stripe_cnt,
> >+                        int stripe_cnt_min, struct lov_qos_rr *lqr)
> >+{
> >+repeat_find:
> >+	if (lov->lov_qos.lq_same_space)
> >+        	weight = lov->lov_qos.lq_ba_min;
> >+        else
> >+                weight = lov->lov_qos.lq_ba_max;
> >+        array_idx = array_start;
> >+        idx_pos = idx_arr;
> >+
> >+	do {
> 
> (style) please use spaces for indenting instead of tabs (in several places).

Fixed and hopefully I found them all.

> >+/* return new alloced stripe count on success */
> >+static int alloc_idx_array(struct obd_export *exp, struct lov_stripe_md *lsm,
> >+                           int newea, int **idx_arr, int *arr_cnt, int flags)
> > {
> >+
> >+        pool = lov_find_pool(lov, lsm->lsm_pool_name);
> >         if (pool == NULL) {
> >-                osts = &(lov->lov_packed);
> >                 lqr = &(lov->lov_qos.lq_rr);
> >+                osts = &(lov->lov_packed);
> 
> No need to change the order of these, and it is more consistent to
> keep the assignmentsin the same order in the "else" clause.

Cleaned up.

> >         } else {
> >                 down_read(&pool_tgt_rw_sem(pool));
> >                 osts = &(pool->pool_obds);
> >                 lqr = &(pool->pool_rr);
> >         }
> > 
> >+        rc = qos_calc_rr(lov, osts, lqr);
> >+        if (rc)
> >+            	GOTO(out, rc);
> >+        osts = &lqr->lqr_pool;
> > 
> >         rc = qos_calc_ppo(exp->exp_obd);
> >         if (rc)
> >                 GOTO(out, rc);
> 
> Calling qos_calc_ppo() for every object allocation could be very
> expensive.  It doesn't seem that this has been implemented in an
> incremental manner, and I think in a case like Spider with 1300+
> OSTs, it might be too much.

In the old code qos_calc_ppo was also always called. By using the lq_dirty flag you avoid
redoing the calculations all the time. Only in QOS are the calculations considered.

------- Comment #21 From Nathan Rutman 2009-10-20 15:20:42 -------

(From update of attachment 26578 [details])
Inspection template(s):
Bug:       18547 
Developer: simmonsja@ornl.gov 
Size:      173 Lines of Change 
Date:      2009-10-19
Defects:   
Type:      CODE
Inspector: nathan.rutman@sun.com

--------------
First of all, thanks for the great reduction in common code.  Looks like the guts are now all in
alloc_stripe.  Trying to figure out the general idea from there:

Assuming osts are imbalanced and !lqr and other simplifications, always put the first stripe on the
array_start ost.  For the remaining stripes, choose in order the osts with the most available
space.

Assuming balanced osts, start with array_start and use the osts in pool order.

It looks like lq_threshold_rr controls the balance/imbalanced decision, but once that decision is
made we either go fully rr or fully free-space.

So although we are removing the randomness, which is probably good, do we really want to be this
all-or-nothing about free space?  Maybe so.  A more complete text description of the algorithm is
probably more useful than code at this point.

What I don't see here is (what I think is) one of the key points (from comment #3): 
"If the weighting was subtle e.g. round-robin, but skipping more-full OSTs 1/n of the times (where
n is proportional to the imbalance) then it could essentially be in use all of the time.  If we
implemented QOS as a part of the round-robin allocator that is always active, just by skipping OSTs
periodically (rather than randomly) it would provide much more uniform loading and would prevent
the OSTs from getting far out of balance in the first place."

In other words, instead of the all-or-nothing rr or free space, I was imagining instead we would
make lq_threshold_rr a multiplier of how often we skip any osts to try to restore the space
balance.  An "aggressiveness" factor for how quickly we want to restore even loading.

You might argue this patch kind of does that, but it looks like it skips the more-full OST's
continuously, until they are no longer more-full.  I was hoping for a subtler/slower mechanism.

------- Comment #22 From Andreas Dilger 2009-10-20 18:32:47 -------

(In reply to comment #21)
> What I don't see here is (what I think is) one of the key
> points (from comment #3): "If the weighting was subtle e.g.
> round-robin, but skipping more-full OSTs 1/n of the times
> (where n is proportional to the imbalance) then it could
> essentially be in use all of the time.  If we implemented
> QOS as a part of the round-robin allocator that is always
> active, just by skipping OSTs periodically (rather than
> randomly) it would provide much more uniform loading and would
> prevent OSTs from getting far out of balance in the first place."
> 
> In other words, instead of the all-or-nothing rr or free
> space, I was imagining instead we would make lq_threshold_rr
> a multiplier of how often we skip any osts to try to restore
> the space balance.  An "aggressiveness" factor for how quickly
> we want to restore even loading.
> 
> You might argue this patch kind of does that, but it looks
> like it skips the more-full OST's continuously, until they
> are no longer more-full.  I was hoping for a slower mechanism.

I agree completely.  I was thinking more along the lines of keeping
a penalty per OST and per OSS, and then just keeping the OST array
in the same order always.  We would hit the most-empty OST more
often, not by virtue of sorting the array, but by virtue of skipping
the less-empty OSTs more often.

I would hope that the weighting/penalties on the OSTs would be such
that we don't pound on the most-empty OST repeatedly and then skip
every other OST when traversing the OST list, but if lq_threshold_rr
is set such that space balancing has a high priority over uniformity
then that is what would happen.

In all cases, we wouldn't allocate from the same OST more than once
per file, as that is a sure way to hurt performance.

------- Comment #23 From James Simmons 2009-10-23 12:55:27 -------

(In reply to comment #19)
> James A Simmons wrote:
> > For bug 18547 you talked about refining the weighting code with a
> > recently used within ths file's stripe allocation. Do you mean using
> > the time difference between when the stripe was created and when the
> > stripe was filled with data?
> No, recently-used weighting is about ordering, not time.  A recently used OSS is one that had a
> stripe allocated from it more recently than other OSS's.  The paragraph I think you're referring to
> is  this:
> 
> 1. Files may have multiple stripes.  That's the only reason for the above sorting; if all files had
> a stripecount of 1 the entire above argument doesn't apply - assuming large number of files
> relative to number of OST's, we will access all OSTs equally.  But for a file with stripecount > 1,
> we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS
> weighting in current code.  This actually should be refined - it should be
> recently-used-within-this-file's-stripe-allocation).
> 
> The current code does this:
> When allocating a stripe from an OSS (i.e. any OST on that OSS), add a penalty to that OSS.  Decay
> that penalty as stripes are allocated from other OSS's.  When calculating the weight for any OST,
> include the per-OSS penalty for it's OSS in the calculation.
> 
> What it should be is this:
> When allocating a stripes for any single file, try to spread them across OSS's.
> Using my nomenclature from comment #15,
> OSTs: AABBBB
> file1: ABAB
> file2: AB
> file3: B
> file4: B
> That is, ignoring any other reasons for allocating on or avoiding a particular OST, we should
> spread stripes for individual files across OSS's so that network bandwith is more frequently
> parallelized.  Try to avoid allocations like:
> file1: BBBB
> file2: AA

I have been pondering this. So for the case of the stripe count being less than the number of OSS
we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file
will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It
appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. 
Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally
selecting the best and using them will make them not so great so they have to be marked so not to
be used until all OSS are used.

------- Comment #24 From Nathan Rutman 2009-10-23 14:05:51 -------

(In reply to comment #23)
> I have been pondering this. So for the case of the stripe count being less than the number of OSS
> we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file
> will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It
> appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. 
> Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally
> selecting the best and using them will make them not so great so they have to be marked so not to
> be used until all OSS are used.

Just to be clear, we're using OSS and OST in very particular senses here - they are not synonyms. 
When you say "64 OSS's" you scare me...

There really is no "best" OSS, although you might say in general that it's better to use OSS's
whose total stripe count (sum of all stripes on all OSTs served by that OSS) is less than
more-utilized OSS's.  We can maybe think about it another way:  assume the rr allocator is already
choosing an order of OSTs such that the list of OSTs spreads nicely among the OSS's.  Three OSS's
with 2 OST's each looks like:
(maybe clearer nomenclature, we'll use the OST #n on OSS letter X: A1 is first ost on oss A.)
A1 B1 C1 A2 B2 C2
Allocating stripes in order from the list for files of various stripe counts might look like this:
file1: A1
file2: B1 C1 A2
file3: B2
file4: C2 A1
which is fine.  The problem comes if say C1 and A2 are now pretty full:
file5: B1 (skip full C1) (skip full A2) B2
Now we have two stripes for a two-stripe file on a single OSS, meaning clients cannot access the
stripes in parallel.  What we want is:
file5:  B1 (skip full C1) (skip full A2) (skip sameoss B2) C2

On the other thread here, "full" is a matter of degree, as should be "skip full", but "skip
sameoss" is more of an imperative (unless, of course, there is only one oss).  
1. If the number of stripes <= # available OSS's, each stripe should be on a unique OSS
2. If the number of stripes > # available OSS's, we should try to spread the stripes across OSS's
as evenly as possible
3. If all the OSTs on a particular OSS are overfull, we should treat that OSS as less available. 
There is a tradeoff here with #2, which the qos_prio_free affected.
So in fact, we need to break up qos_prio_free and qos_prio_wide into separate adjustments -- they
are not simply the converse of each other, but rather both affect how far we are willing to stray
from our optimal rr ordering along two different dimensions.  High qos_prio_free will make us more
aggressively space balance (skip fuller osts), and high qos_prio_wide will make us more
aggressively insure goal #2 in the face of all other factors (free space balance, recently used
osts, recently used oss's, anything else we throw into the mix).  If qos_prio_wide and
qos_prio_free are both set to 0, we will always use the optimal rr ordering (except for the
anything else factors.  Maybe we need a third (or more) adjusters: qos_prio_rr which tries harder
to ignore the anything else's).

------- Comment #25 From James Simmons 2009-10-26 06:14:40 -------

(In reply to comment #24)
> (In reply to comment #23)
> > I have been pondering this. So for the case of the stripe count being less than the number of OSS
> > we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file
> > will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It
> > appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. 
> > Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally
> > selecting the best and using them will make them not so great so they have to be marked so not to
> > be used until all OSS are used.
> 
> Just to be clear, we're using OSS and OST in very particular senses here - they are not synonyms. 
> When you say "64 OSS's" you scare me...

I really was talking about the OSSs.

> There really is no "best" OSS, although you might say in general that it's better to use OSS's
> whose total stripe count (sum of all stripes on all OSTs served by that OSS) is less than
> more-utilized OSS's.  We can maybe think about it another way:  assume the rr allocator is already
> choosing an order of OSTs such that the list of OSTs spreads nicely among the OSS's.  Three OSS's
> with 2 OST's each looks like:
> (maybe clearer nomenclature, we'll use the OST #n on OSS letter X: A1 is first ost on oss A.)
> A1 B1 C1 A2 B2 C2
> Allocating stripes in order from the list for files of various stripe counts might look like this:
> file1: A1
> file2: B1 C1 A2
> file3: B2
> file4: C2 A1
> which is fine.  The problem comes if say C1 and A2 are now pretty full:
> file5: B1 (skip full C1) (skip full A2) B2
> Now we have two stripes for a two-stripe file on a single OSS, meaning clients cannot access the
> stripes in parallel.  What we want is:
> file5:  B1 (skip full C1) (skip full A2) (skip sameoss B2) C2

What I was asking is does the OSS order have to be always the same i.e ABC ABC ABC verses
ABC BAC ACB etc. That is what I meant by best OSS.

------- Comment #26 From Andreas Dilger 2009-10-26 22:10:00 -------

(In reply to comment #23)
> > No, recently-used weighting is about ordering, not time.
> > A recently used OSS is one that had a stripe allocated
> > from it more recently than other OSS's.

I would add a caveat here, Nathan, that you may not be aware of.
The QOS code was changed by me to decay the OSS and OST penalties
over time (I think they reach 0 in at most 40 minutes, on the
assumption that HPC jobs do IO every hour and then sit idle between
those times).  Otherwise, what happens is that a "good" OST gets a
big penalty, and sits idle for a long time and then is ignored by
later allocations even though there is no more IO going on.

[note s/OSS/OST/ as needed to clarify below comments]

> I have been pondering this. So for the case of the stripe
> count being less than the number of OSS we pick the "best"
> OSSes. Say for a system with 64 OSTs over 8 OSS nodes, and
> default stripe count 4. The first file will pick the 4 best
> OSTs.

Right, with the value deciding "best" for each OSS and each
OST being a time- and load- varying value, WITH HISTORY.

> Now the question is what about the second, three pick and
> so on. It appears the correct thing to do is pick the next
> best batch of OSS until we have hit all OSS. 

Now, remember there are multiple variables in play here
already (2 for the current QOS, it would be good to have more):
1) which OSS nodes were just used for the previous create
2) which OSTs were just used for the previous create

In the current QOS code there is a "penalty" associated with
each OSS and a separate one associated with each OST.  Those
penalties are in terms of some amount of "negative" available space
on the OST and the number of OSTs per OSS.  When an OST (on a
particular OSS) is selected for an object creation, its penalty
is increased, making it a "less good" OST (and hence OSS) to
select for the next file.

When an OST/OSS is NOT selected for a particular file, its
penalty is decreased some fraction.  The penalties are chosen
today so that when all OSS nodes have been selected once the
first OSS's penalty is zero and it will be selected again,
and similarly once all OSTs have been selected once the first
OST's penalty will be zero again.  The bad thing about the
current implementation is that the weight (available - penalty)
is only used as a probability for a random number selection.
Statistically the allocations will be even, but in practice
they are not.

When the space or OST distribution is non-uniform, then the
penalties + available space do not balance out totally evenly
and statistically the allocations will favour OSTs with more
free space.

> Then we start over will all the OSSs fresh and pick the
> best of the best and so on.

The one thing to watch out for here is you don't _necessarily_
want to select all of the OSSes, or all of the OSTs on every
pass, if they are far below optimal.

> Normally selecting the best and using them will make them
> not so great so they have to be marked so not to
> be used until all OSS are used.

Again, they should be marked not to be used until they are
no longer worse than any other OSS to be used. 


I think one thing that could be used to make the new QOS
allocator MUCH better than the old one is to use "real"
values for the IO bandwidth of the OSTs and network bandwidth
of the OSS nodes, so that they can be compared objectively
instead of assuming all OSTs are equal except for available
space, and all OSSes have bandwidth that is proportional to
the number of OSTs thereon.

Since the MDS already is doing OST_STATFS RPCs to the OSTs
every 5s, we can return a fairly good idea of the actual disk
bandwidth of the OSTs and of the whole OSS.  I don't suggest
that these values be used to determine the instantaneous load
on an OSS/OST (or if they do, they should decay with time as
they become less accurate) but rather as a measuring stick by
which different OSTs/OSSes can be compared.

Initially, I have no objection to using some artificial
placeholder values for these, until we wire in the lprocfs
and adaptive timeout stats to get real performance values
at runtime.  The important point is that the heuristics
should be developed with these metrics in mind initially.
We should also make "proxy" estimates on space usage for
each file, based on the filesystem average space usage
I'd suggest using the ORNL/DDN9900 test values as a starting
point - 200MB/s per OST, 1200MB/s per OSS.

That gives us an algorithm that looks something like what
Nathan described (to do the initial OST ordering to spread
out the OSSes evenly among the OST array), and then as we
walk this array in RR order:

statfs_interpret(new_osfs)
{
        for each pool of this OST {
               pool->available += (new_osfs->b_avail - ost->osfs->b_avail);
               pool->avg_available = pool->available / pool->ost_count;
               pool->used_objects += (new_osfs->used_files - ost->osfs->used_files);
               pool->avg_object_size = pool->available / pool->used_objects;

               pool->bandwidth += (new_osfs->ost_bandwidth - ost->osfs->ost_bandwidth;
               pool->avg_ost_bw = pool->ost_bandwidth / pool->ost_count;

               pool->queue_depth += (new_osfs->oss_queue_depth - ost->osfs->oss_queue_depth);
               pool->queue_rate += (new_osfs->oss_queue_rate - ost->osfs->oss_queue_rate);
        }

        /* This is the same for all OSTs on the OSS, just use the latest value */
        oss->max_bandwidth += (new_osfs->oss_max_bandwidth - ost->osfs->oss_max_bandwidth);
        oss->bw_penalty = (oss->bw_penalty + new_osfs->oss_cur_bandwidth) / 2;

        /* save old values for reuse next time we are called */
        ost->osfs = new_osfs;
}

notify_add_del

alloc_qos()
{

        /* only update statfs data for OSTs in this pool */
        qos_statfs_update(pool);

restart:
        for (cur_stripes = 0, cur_idx = pool->next_index;
                     cur_stripes < needed_stripes && cur_idx++ != pool->next_idx;
             cur_idx %= pool->max_idx) {
                /* OSC deactivated, ENOSPC, EROFS, etc */
                if (ost_is_unusable(lov, cur_idx))
                       continue;

                /* This is where all of the heuristic complexity lies.
                 * It needs to combine all of the available metrics to
                 * determine which OSTs are the "best" ones to choose,
                 * from increasingly out-of-date estimates of the current
                 * load on the OSS/OSTs. */
                if (qos_calc_weight(pool, cur_idx) < pool->threshold)
                       continue;

                /* This is an acceptable OST/OSS to use, account for usage */
                qos_used()
                cur_stripes++;
        }

        if (cur_stripes < needed_stripes) {
                reduce pool->threshold;
                recompute pool
                goto restart;
        }
}

qos_used()
{
        /* create_counter allows us to determine in ost_current_value()
         * how long ago this OSS/OST was last used, instead of having
         * to to iterate over all of the OSTs/OSSes on each create to
         * adjust their penalties. */
        counter = atomic_inc_ret(lov->create_counter);
        ost->last_used = counter;
        oss->last_used = counter;     

        /* Reduce pool-weighted available resources for this OST/OSS.
         * They are per-pool to avoid skew by OSTs/OSSes outside pool.
         * We assume that an average of the OST bandwidth will be
         * used by this file, so that OSTs/OSSes with more real
         * bandwidth will get proportionately more objects created. */
        ost->avail_penalty += avg_object_size;
        ost->bw_penalty += avg_ost_bw;
        oss->bw_penalty += avg_ost_bw;
}

The most important part of the algorithm is qos_calc_weight()
and how we compute the pool->threshold value.  Unfortunately, as
Nathan previously called this, it is comparing apples to oranges
to cumquats, so we need to normalize these values based on the
weighting specified by the admin (and possibly the user, later?).

Since the alloc_qos() algorithm above is always going through the
OST list in round-robin ordering, we don't need to worry about
completely eliminating the used OSTs from the selection list each
time.

qos_calc_weight(pool, ost_idx)
{
        ost = pool->lov->osts[ost_idx];
        oss = ost->oss;

        /* This OSS/OST has been "unused" since the last time the
         * counter was set, so (assuming even loading) it should
         * have recovered some fraction of its bandwidth.  Even if
         * it hasn't we should assume that the incremental increase
         * in load by adding one more in-use file on an OSS is even
         * across all other OSSes - each one gaining the same fraction.
         * 
         * [Note that "/= age" is probably a bad weighting function,
         *  but is at least a starting point.  Better would be something
         *  like ">> (age / factor)" that is exponential decay.]
         */
        ost_age = lov->create_counter - ost->last_used;
        oss_age = lov->create_counter - oss->last_used;

        ost->avail_penalty /= ost_age;
        ost->bw_penalty /= ost_age;
        oss->bw_penalty /= oss_age;
        oss->lat_penalty /= oss_age; /* this one should probably be computed differently */

        weight = lq_prio_free * (ost->osfs->os_avail - ost->avail_penalty);
        weight += lq_prio_bw * (ost->bandwidth - ost->bw_penalty) +
                               (oss->bandwidth - oss->bw_penalty);
        weight += lq_prio_latency * (oss->queue_depth - oss->queue_penalty);

}

------- Comment #27 From Terry Rutledge 2009-11-30 11:28:00 -------

Raising rank, as this in now seen in bug 21258.

------- Comment #28 From James Simmons 2009-11-30 11:37:34 -------

(In reply to comment #27)
> Raising rank, as this in now seen in bug 21258.

Could I be given permission to see the bug.

------- Comment #29 From Nathan Rutman 2009-11-30 13:40:53 -------

(In reply to comment #28)
> (In reply to comment #27)
> > Raising rank, as this in now seen in bug 21258.
> 
> Could I be given permission to see the bug.

Not sure, but the problem in 21258 is this: current allocator (1.8.1) sometimes allocates stripes
from different OSTs on the same OSS before OSTs on different OSS's.  I'm not sure it's doing
anything "wrong", but optimally it would try to use different OSS's as much as possible.  That's
already one of the goals of this bug.

------- Comment #30 From Robert Read 2010-05-25 11:21:15 -------

Updating Lustre 2.1 project priority: P5.

First Last Prev Next No search results available