On 2013/23/03 4:00 PM, "Lee, Brett" wrote: >Hi Andreas and Oleg, > >Sent on Saturday, but hoping you wonıt read any further until at least >Monday. J > >Am finding that teaching Lustre is often a humbling experience. >Students ask questions and am not always able to figure out the answer. >One of those areas is as ³simple² as how OSTs are selected when >allocating objects for a new file. Specifically, am trying to clarify >the qos_threshold_rr and qos_prio_free settings - hoping one of you can >help by skimming the notes below and offering corrections or answers. > >(FWIW, at the bottom of this thread I've added some quotes from the 2.x >manual around my area of confusion.) > >Seems like qos_threshold_rr balances between the two allocation >algorithms, QOS and RR, like this: > > 0% means 0% of the time use RR (always use QOS) 10% means 10% of the >time use RR 20% means 20% of the time use RR ... >100% means 100% of the time use RR Not quite - what it means is that when the space imbalance is more than N% between the most full OST and the least full OST use QOS. Default is when the free space imbalance is > 16% the allocator stops using round-robin and uses weighted space balance. > >RR always allocates objects on OSTs sequentially, both within a file, >and for each new file, like this: > >File 1: OST1, OST2, OST3, OST4 >File 2: OST2, OST3, OST4 >File 3: OST3, OST4, OST5, OST6, OST7, OST8 File 4: OST4, OST5, OST6 Not quite, since this would put uneven load on the OSTs. What actually happens (assuming 8 OSTs numbered 0-7) is: File 1: OST1, OST2, OST3, OST4 File 2: OST5, OST6, OST7 File 3: OST0, OST1, OST2, OST3, OST4, OST5 File 4: OST6, OST7, OST0 > >The other setting, qos_prio_free, balances between free space (on the >OSTs) and other factors (speed and the number of existing objects) like >this: > >0% means 0% of the time allocation is based free space 100% means 100% >of the time allocation is based upon free space Something like that. qos_prio_free = 0 means that each OST is selected once (i.e. priority for balance) while qos_prio_free = 100 means that each OST is selected (to some extent) proportional to the free space on each OST. > >This is where I'm unclear on the behavior of Lustre, so I've added some >statements below that are my best guess on how Lustre qos_prio_free works. > >Seems like the QOS algorithm may itself have two different algorithms: >free space (FS) and something else (SE), and only one can be selected >at a time. Right - round-robin is the default, weighted space balance is the only used when the free space imbalance between the most and least full OST is at least qos_threshold_rr. I'd like to unify these into something called weighted round-robin (see LU-9). > >If the FS algorithm is selected, it is implemented only when there is a >20% differential in OST utilization. Further, that differential is >taken between the most empty and the most full OST. Right. > >In FS, when there is a 20%+ differential, the OSTs are sorted (least >full to most full) and objects are allocated to the new file from the >sorted order. Not quite - yes, they are sorted in terms of free space, but which one is selected is based on a random number between 0 and {total number of free blocks}. The one with more free space is proportionally more likely to be selected. > >If the FS algorithm is selected and there is not a 20% differential, >then the SE algorithm is executed. Yes. > >If the SE algorithm is executed, then these values (OSS distribution, >speed, number of existing objects) are used (please see notes from the >manual, below), however I don't understand how values are assigned to >OSS distribution or speed, nor can I grasp how they are factored in. No, the Round-Robin algorithm is purely based on the number of OSTs per OSS. It distributes all of the OSTs evenly so that each OSS gets allocated once before a second OST on an OSS is allocated. > >Lastly, and related to this discussion (especially the "OSS distribution" >factor mentioned above), I was working with a customer (HDS) last week >that wanted to ensure that writing to a file would not write to two >sequential objects on the same OSS. This happens automatically already for round-robin allocation, and is done at MDS startup time. >Thus, >they distributed OSTs across OSSs such that OST0 went on OSS0, OST1 >went on OSS1, OST2 went on OSS2, etc - thinking that when the RR >algorithm was used, each OST would be written to sequentially and thus >each OSS would also be written to sequentially as well. Yes, this is what RR does internally, regardless of what order the OSTs are actually numbered. > >This seemed like a reasonable idea, but wanted to ask your opinion >while on the topic. > >If you've made it this far, I can't thank you enough. Really really >REALLY. Thank you so much for considering this long question. -Brett > > >From the 2.x manual (p.118 and 234): > >Increasing this value puts more weighting on free space. When the free >space priority is set to 100%, then location is no longer used in >stripe-ordering calculations and weighting is based entirely on free >space. > >Setting the priority to 100% means that OSS distribution does not count >in the weighting, but the stripe assignment is still done via >weighting. If OST 2 has twice as much free space as OST 1, it is twice >as likely to be used, but it is NOT guaranteed to be used. > >Also note that free-space stripe weighting does not activate until two >OSTs are imbalanced by more than 20%. Until then, a faster round-robin >stripe allocator is used. (The new round-robin order also maximizes >network balancing.) > >Quality of Service (QOS) considers an OST's available blocks, speed, >and the number of existing objects, etc. Using these criteria, the MDS >selects OSTs with more free space more often than OSTs with less free >space. > >-- >Brett Lee >Sr. Systems Engineer >Intel High Performance Data Division > > > > Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division