On 2013/23/03 4:00 PM, "Lee, Brett" <brett.lee@intel.com> wrote:

>Hi Andreas and Oleg,
> 
>Sent on Saturday, but hoping you wonšt read any further until at least 
>Monday. J
> 
>Am finding that teaching Lustre is often a humbling experience.  
>Students ask questions and am not always able to figure out the answer.  
>One of those areas is as łsimple˛ as how OSTs are selected when 
>allocating objects for a new file.  Specifically, am trying to clarify 
>the qos_threshold_rr and qos_prio_free settings - hoping one of you can 
>help by skimming the notes below and offering corrections or answers.
> 
>(FWIW, at the bottom of this thread I've added some quotes from the 2.x 
>manual around my area of confusion.)
> 
>Seems like qos_threshold_rr balances between the two allocation 
>algorithms, QOS and RR, like this:
> 
> 0% means 0% of the time use RR (always use QOS)  10% means 10% of the 
>time use RR  20% means 20% of the time use RR ...
>100% means 100% of the time use RR

Not quite - what it means is that when the space imbalance is more than N% between the most full OST and the least full OST use QOS.  Default is when the free space imbalance is > 16% the allocator stops using round-robin and uses weighted space balance.

> 
>RR always allocates objects on OSTs sequentially, both within a file, 
>and for each new file, like this:
> 
>File 1:  OST1, OST2, OST3, OST4
>File 2:  OST2, OST3, OST4
>File 3:  OST3, OST4, OST5, OST6, OST7, OST8 File 4:  OST4, OST5, OST6

Not quite, since this would put uneven load on the OSTs.  What actually happens (assuming 8 OSTs numbered 0-7) is:

File 1:  OST1, OST2, OST3, OST4
File 2:  OST5, OST6, OST7
File 3:  OST0, OST1, OST2, OST3, OST4, OST5 File 4:  OST6, OST7, OST0



> 
>The other setting, qos_prio_free, balances between free space (on the
>OSTs) and other factors (speed and the number of existing objects) like
>this:
> 
>0% means 0% of the time allocation is based free space 100% means 100% 
>of the time allocation is based upon free space

Something like that.  qos_prio_free = 0 means that each OST is selected once (i.e. priority for balance) while qos_prio_free = 100 means that each OST is selected (to some extent) proportional to the free space on each OST.

> 
>This is where I'm unclear on the behavior of Lustre, so I've added some 
>statements below that are my best guess on how Lustre qos_prio_free works.
> 
>Seems like the QOS algorithm may itself have two different algorithms:
>free space (FS) and something else (SE), and only one can be selected 
>at a time.

Right - round-robin is the default, weighted space balance is the only used when the free space imbalance between the most and least full OST is at least qos_threshold_rr.  I'd like to unify these into something called weighted round-robin (see LU-9).

> 
>If the FS algorithm is selected, it is implemented only when there is a 
>20% differential in OST utilization.  Further, that differential is 
>taken between the most empty and the most full OST.

Right.

> 
>In FS, when there is a 20%+ differential, the OSTs are sorted (least 
>full to most full) and objects are allocated to the new file from the 
>sorted order.

Not quite - yes, they are sorted in terms of free space, but which one is selected is based on a random number between 0 and {total number of free blocks}.  The one with more free space is proportionally more likely to be selected.

> 
>If the FS algorithm is selected and there is not a 20% differential, 
>then the SE algorithm is executed.

Yes.

> 
>If the SE algorithm is executed, then these values (OSS distribution, 
>speed, number of existing objects) are used (please see notes from the 
>manual, below), however I don't understand how values are assigned to 
>OSS distribution or speed, nor can I grasp how they are factored in.

No, the Round-Robin algorithm is purely based on the number of OSTs per OSS.  It distributes all of the OSTs evenly so that each OSS gets allocated once before a second OST on an OSS is allocated.

> 
>Lastly, and related to this discussion (especially the "OSS distribution"
>factor mentioned above), I was working with a customer (HDS) last week 
>that wanted to ensure that writing to a file would not write to two 
>sequential objects on the same OSS.

This happens automatically already for round-robin allocation, and is done at MDS startup time.

>Thus,
>they distributed OSTs across OSSs such that OST0 went on OSS0, OST1 
>went on OSS1, OST2 went on OSS2, etc - thinking that when the RR 
>algorithm was used, each OST would be written to sequentially and thus 
>each OSS would also be written to sequentially as well.

Yes, this is what RR does internally, regardless of what order the OSTs are actually numbered.

> 
>This seemed like a reasonable idea, but wanted to ask your opinion 
>while on the topic.
> 
>If you've made it this far, I can't thank you enough.  Really really 
>REALLY.  Thank you so much for considering this long question.  -Brett

> 
> 
>From the 2.x manual (p.118 and 234):
> 
>Increasing this value puts more weighting on free space. When the free 
>space priority is set to 100%, then location is no longer used in 
>stripe-ordering calculations and weighting is based entirely on free 
>space.
> 
>Setting the priority to 100% means that OSS distribution does not count 
>in the weighting, but the stripe assignment is still done via 
>weighting. If OST 2 has twice as much free space as OST 1, it is twice 
>as likely to be used, but it is NOT guaranteed to be used.
> 
>Also note that free-space stripe weighting does not activate until two 
>OSTs are imbalanced by more than 20%. Until then, a faster round-robin 
>stripe allocator is used. (The new round-robin order also maximizes 
>network balancing.)
> 
>Quality of Service (QOS) considers an OST's available blocks, speed, 
>and the number of existing objects, etc. Using these criteria, the MDS 
>selects OSTs with more free space more often than OSTs with less free 
>space.
> 
>--
>Brett Lee
>Sr. Systems Engineer
>Intel High Performance Data Division
> 
> 
> 
>


Cheers, Andreas
--
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division