Bugzilla – Bug 18547
optimize QOS RR allocator
Last modified: 2010-05-25 11:21:15
You need to log in before you can comment on or make changes to this bug.
This bug is as the extension of 18334.
Adilger's comment (transferred from bug 18334 comment25): Once QOS is active, why did setting lq_prio_free = 0 did not result in "mostly" even object allocation (i.e. consider the loading at least as important as available space)? It seems in the tests done by Cray that the OST selection ratio almost did not change very much at all whith lq_prio_free = 0, and that is wrong. We need to verify with decreasing lq_prio_free that the available space contributes less to the overall weighting. Even with the larger "qos_threshold_free" eventually one of the OSTs will almost run out of space, and then user applications will begin to fail if they are still allocating on this OST instead of limiting themselves to OSTs that are free. I don't think having a tunable for "qos_threshold_free" is necessarily bad, but I don't think it will fix this problem in the long term, and instead we need to understand why QOS isn't doing a better job. It may be that simply changing the threshold from 20% of the least full OST to, say, 75% of the most full OST will be enough to avoid QOS in most cases. Having a large difference when the OSTs are mostly empty isn't harmful, and when some OSTs will be very close to full this will trigger QOS regardless. This will avoid QOS mode for most cases (which is really what the customer wants) but still allow QOS mode to be enabled to avoid hitting ENOSPC when there is still lots of space.
There is a lot of discussion at bz18334 about this topic. Please refer to bz18334 for the more details.
More comments from original bug: I have long thought that in qos_prep_create() that the "if (!bavail)" check to only skip OSTs that have 0 free space is too weak, and will result in applications hitting ENOSPC on OSTs that have almost no free space because there is statistically always some chance of them being selected for object allocation. Instead, think this should be something like: if (bavail < obd->osfs.os_bavail >> 5 / num_active) (i.e. if this OST has less than 3% of the space of an average OST it should be skipped entirely, rather than having the risk of objects being allocated there). Of course, as soon as you give a fixed number (3%) somebody will want something different. We want the OSTs to be evenly filled for as much of their lives as possible, to be as evenly loaded (disk and network io) as possible, for optimum filesystem performance. If the weighting was subtle e.g. round-robin, but skipping more-full OSTs 1/n of the times (where n is proportional to the imbalance) then it could essentially be in use all of the time. If we implemented QOS as a part of the round-robin allocator that is always active, just by skipping OSTs periodically (rather than randomly) it would provide much more uniform loading and would prevent the OSTs from getting far out of balance in the first place. Say the OSTs are 100MB in size (1% = 1MB) and OST10-19 have 9x as much free space as OST0-9 then the RR QOS selection would skip OST0-9 about 8/9 times, but never skip OST10-19. If each file created was 1MB in size then the first 11 files would use 1MB on each of the 10 emptier OSTs, and 1MB on one of the fuller OSTs. Repeat 10x and 10MB is used uniformly on each of the emptier OSTs, but only 1MB on each of the fuller OSTs. This is what we tried to do with the random QOS weighting, but just as with the original LOV allocator (which randomly picked OSTs instead of round-robin) the chance of having an imbalance (e.g. average 2 objects/OST but some had 0 and some had 4) was too high. One suggestion is to use an "accumulated error" mechanism, where each of the OSTs is present in the RR list (in the optimized OST/OSS order it normally is), and each time it is skipped from the RR selection (because the OST available space is < threshold or average) the OST free space is added to a "bonus" counter for that OST. When the OST (available space + bonus) is large enough the OST will be picked by normal RR selection and then its bonus reset to 0 and the threshold will be again be too low to select it. The MDS can estimate the average file size from the statfs data, so we might increment the bonus by the average file size? One thing we might want to do is initialize the bonus with some random amount (proportional to the free space?) at MDS startup so that some of the full OSTs become available at any time instead of all at once. It seems we should also include some fraction (maybe 1/2, tunable via qos_prio_free) of the bonus value into the average so that less-used OSTs (with higher bonus) are more likely to be used instead of all being chosen equally. The problem with having equal weights (i.e. not taking the bonus into account for $average_free) is that e.g. the OST immediately following the new one will be picked disproportionately often. In the above example, if the file size is 1 then we skip the full OSTs 9 times each increasing their bonus by 9, and filling the new OST by approximately 9, giving it a negative bonus (penalty) of 9. Penalty can be reset at next statfs for this OST, but bonus not. Then $average_free is 10 * 9 + 91 = 18 and all of the OSTs are in the running. For N-striped files one idea I had was to subtract the just-used OST's space from the current file's "average" so that the remaining OSTs become proportionally more attractive. Continuing in the above example, but with a 2-stripe file, the current 2-stripe file's $average_free is now 10, and any of the OSTs can be selected. Keeping the filesystem total available space as a running total updated only in the statfs interpreter callbacks (without having to iterate over 1000 OSTs to compute it each time) is fairly important to efficiency. If one of the full OSTs had been previously selected then its bonus = -1, and the rest have bonus = 1, so the average is 10.9. The just used OST only has weight 9 and will not be selected this round. One OST will be picked and its weight set to -1, and the remainder get bonus = 2, so average = 11.7 for next round, and only OSTs with weight 12 can be selected for the second stripe. For 1-stripe files (using 1/2 weighting of the bonus in the average) we would need to allocate 14 files on the free OST before allocating a file on one of the other OSTs. We need a simulator that can just plug in lov_qos.c so that we can both analyze the current code, and secondly validate any changes made. In other projects this has been done by adding unit-test code at the bottom of this file, and it is "#ifdef UNIT_TEST" so it is not in normal compiles, but creates a test executable when compiled with -DUNIT_TEST.
Updating Lustre 2.0 project priority: P5.
Begin investigation. With permission I like to assign this bug to myself. While examining the code do we want to also apply this balancing to alloc_specific as well. With alloc_specific the first stripe offset must be picked but all other offsets afterwards could be optimized much like the round robin case. The second difference I noticed is that for obd_precreate in lov_qos.c the passing parameters vary. In alloc_specific, which makes sense, we have if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) && (i != 0 || speed >= 2)) continue; alloc_rr: if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) continue; alloc_qos: if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2) continue; Which is correct? Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the alloc_qos case. This seems incorrect?
(In reply to comment #5) > While examining the code do we want to also apply this balancing > to alloc_specific as well. With alloc_specific the first stripe > offset must be picked but all other offsets afterwards could be > optimized much like the round robin case. I suspect this would be OK. We don't strictly guarantee that the alloc_specific() case follows sequential OST ordering, but it has in the past only skipped OSTs that are unavailable. If the algorithms for the optimized QOS round-robin allocator are good enough, I would want to just move to a single QOS-aware allocator for everything. That would do the best job of keeping the space usage (along with load, RAID state, bandwidth, etc in the future) balanced all the time. > The second difference > I noticed is that for obd_precreate in lov_qos.c the passing > parameters vary. In alloc_specific, which makes sense, we have > > if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) && > (i != 0 || speed >= 2)) > continue; > > alloc_rr: > > if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) > continue; > > alloc_qos: > > if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2) > continue; I think the differences are patially related to what is being requested by the user. In the alloc_specific() case we don't want to skip the requested OST unless it is dead, but for following stripes (i != 0) they may skip OSTs if they are slower than others (i.e. objects not being created). For other allocation modes we don't care about which OST the first object is placed on (hence no (i != 0 || speed >= 2) check). In the alloc_qos() case it is focussing more on space usage, and less on the fact that one OST might be slow (e.g. speed == 1). > Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the > alloc_qos case. This seems incorrect? Well, with your patch in the other bug to have OSCC_FLAG_NOSPC set if the OST is out of inodes or space then the OSTs should never be selected in the first place. What is better at the LOV level is to do the "is this OST below some threshold vs. other OSTs then skip" check, rather than the current "sort based on fullness" algorithm. In addition to using the available space as part of the threshold, the code should be designed to allow other weights to be included (e.g. available bandwidth/IOPS). The current QOS code imposes a "penalty" on recently-allocated OSTs based on a fraction of the total space on the OST, but I suspect this would be better handled by imposing a penalty based on the average file size in the filesystem and/or recently created files based on a running average of the statfs data. On keeping a running average, it makes sense to use a decaying average that takes the total space consumed over the number of files created, based on the delta between two statfs calls, for longer intervals that the statfs data changed (i.e. not based on time).
(In reply to comment #5) > Begin investigation. With permission I like to assign this bug to myself. While examining the code > do we want to also apply this balancing to alloc_specific as well. With alloc_specific the first > stripe > offset must be picked but all other offsets afterwards could be optimized much like the round robin > case. The second difference I noticed is that for obd_precreate in lov_qos.c the passing parameters > vary. In alloc_specific, which makes sense, we have > > if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) && > (i != 0 || speed >= 2)) > continue; * This means "if OSC is slow and it is not the requested * start OST, then it can be skipped, otherwise skip it only * if it is inactive/recovering/out-of-space." */ 0=fast 1=med 2=slow >2=unusable Since we only speed++ if we can't find enough faster osts, and the maximum speed is 2 if (speed < 2) { /* Try again, allowing slower OSCs */ speed++; it looks like that is wrong, and we shouldn't check speed in the if: if ((obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) && (i > 0)) > alloc_rr: > > if (obd_precreate(lov->lov_tgts[ost_idx]->ltd_exp) > speed) > continue; that is right > > alloc_qos: > > if (obd_precreate(lov->lov_tgts[osts->op_array[i]]->ltd_exp) > 2) > continue; here we are not doing a "repeat_find" to choose the fastest OSTs. We probably should. > > Which is correct? > > Next I noticed testing of TGT_FFREE and TGT_BAVAIL only for the alloc_qos case. > This seems incorrect? well, that is what alloc_qos is all about. The rr allocator does not care about fullness, while the qos allocator does. The qos allocator is used until the osts are mostly balanced, at which point the rr allocator is used, so there's no need to use fullness measure there. But there are so many questions of when to use osts, I think it would be best to remove these three different allocators and rewrite from scratch. It should be mostly rr, specific is just rr with a given start index, and qos should be included as Andreas describes above (skipping more-full OSTs 1/n of the time). precreate status should also be taken into account, as well as active/inactive status. 0. create RR pattern, based on currently active OSTs. I think the current pattern is good. Recompute the pattern as OSTs are added, removed, activated, deactivated. Iterating throught the pattern, starting at start_ost, and pick OSTs based on usage factors: 1. inactive. Never assign stripes 2. osc precreate status 1000. Never use. 3. osc precreate status 1-2. Prefer not to use. (we can do the "try again at a lower speed if we can't get enough at the current speed" method currently employed.) 4. qos-space. if an OST has less free space than average by x%, try to skip it y% of the time. Details are in Andreas' comments above. 5. qos-server. if an OSS has fewer OSTs than average, it will not have as much network traffic on it (based on strict rr). So the idea is to increase the number of stripes we assign to low-OST servers (or inversely, if an OSS has more OSTs than average by x%, try to skip it y% of the time. This is what the prio_free currently does.) If we can't get enough OSTs as requested, relax some constraints (precreate value, qos skipping) and see if we can. Occasionally move start_ost around to avoid edge cases. This can get extremely tricky (when to skip, how much to skip.) We are trying to avoid cases where users writing N-stripe files on M OSTs always end up with stripe 1 on the same OST. (For example, if you always skip one stripe out of M, then you're bad when N=M-1.) This is why a simulator (last part of comment #3) would be extremely useful.
For alloc_rr we have *idx_pos = ost_idx; idx_pos++; /* We have enough stripes */ if (idx_pos - idx_arr == *stripe_cnt) break; } if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) { /* Try again, allowing slower OSCs */ speed++; lqr->lqr_start_idx = ost_start_idx_temp; goto repeat_find; } Whereas in alloc_specific we have *idx_pos = ost_idx; idx_pos++; /* We have enough stripes */ if (idx_pos - idx_arr == lsm->lsm_stripe_count) GOTO(out, rc = 0); } if (speed < 2) { /* Try again, allowing slower OSCs */ speed++; goto repeat_find; } For the round robin case even if we find a successful case we end up testing for a OST at a slower speed.
(In reply to comment #8) > For alloc_rr we have > /* We have enough stripes */ > if (idx_pos - idx_arr == *stripe_cnt) > break; > } > if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) { > /* Try again, allowing slower OSCs */ > speed++; > lqr->lqr_start_idx = ost_start_idx_temp; > goto repeat_find; > } > For the round robin case even if we find a successful case we end up testing for a OST at a slower > speed. No, because of the second half of the conditional. idx_pos - idx_arr is the number of allocated OSTs; we only speed++ if we don't have at least stripe_cnt_min. But clearly three slightly different codings is evidence for a unified approach anyhow.
(In reply to comment #9) > (In reply to comment #8) > > For alloc_rr we have > > /* We have enough stripes */ > > if (idx_pos - idx_arr == *stripe_cnt) > > break; > > } > > if ((speed < 2) && (idx_pos - idx_arr < stripe_cnt_min)) { > > /* Try again, allowing slower OSCs */ > > speed++; > > lqr->lqr_start_idx = ost_start_idx_temp; > > goto repeat_find; > > } > > For the round robin case even if we find a successful case we end up testing for a OST at a slower > > speed. > No, because of the second half of the conditional. idx_pos - idx_arr is the number of allocated > OSTs; we only speed++ if we don't have at least stripe_cnt_min. So in the case of round robin EBIG is not possible? I'm considering the case where idx_pos - idx_arr is always less then stripe_cnt_min.
> 5. qos-server. if an OSS has fewer OSTs than average, it will not have as much network traffic on > it (based on strict rr). So the idea is to increase the number of stripes we assign to low-OST > servers (or inversely, if an OSS has more OSTs than average by x%, try to skip it y% of the time. > This is what the prio_free currently does.) I tested a small simulator for selecting disk for this case. Now there are two ways to approach this. One is in alloc_rr ost_count loop we could skip the disk for the balancing. The second options is in qos_calc_rr we create a optimized lqr_pool. The pro for the second approach is that this special lqr_pool is created only when a new OST is added or substracted to a OSS. Thus eliminating the the need to do a balancing calculation when we are allocating stripes. The con is the lqr_pool can be larger than the original ost_pool. Playing with the simulator I found this ratio can be very large in the case of where we have very few OSSs with lots of OSTS and we add a new OSS with only one OST. This case causes the most severe imbalance. If a new OSS is added the imbalance will be less as the number of OSTs for this new OSS approaches the small amount as the other OSSs. Here are a few examples. Each OSS has the same number of disk. The second case we compare is when one OSS loses all OSTs except one which is the worst case. 4 OSS with 2 OSTS. Pool size is 8. Imbalance case the pool size is 10. 4 OSS with 256 OSTS. Pool size is 1024. Imbalance case the pool size is 66304. 65X ratio. 4 OSS with 4096 OSTs. Pool size is 16384. Imbalance case the pool size of 16789504. 1025X ratio. 64 OSS with 2 OSTS. Pool size is 128. Imbalance case the pool size is 130. 64 OSS with 256 OSTS. Pool size is 16384. Imbalance case the pool size is 81664. 5X ratio. 64 OSS with 4096 OSTS. Pool size is 262144. Imbalance case the pool size is 17035264. 65X ratio. 256 OSS with 2 OSTS. Pool size is 512. Imbalance case pool size is 514. 256 OSS with 256 OSTS. Pool size is 65536. Imbalance case the pool size is 130816. 2X ratio. 256 OSS with 4096 OSTS. Pool size 1048576. Imbalance 17821696. Ratio of 17X As you can see the more OSS, even with lots of disks, the impact of the imbalance shrinks. Also the less OSTS per OSS the quicker the imbalance approaches the size of the original pool size.
James Simmons <simmonsja@ornl.gov> wrote at 2009-09-04 11:34:02: > I tested a small simulator for selecting disk for this case. Now there > are two ways to approach this. One is in alloc_rr ost_count loop > we could skip the disk for the balancing. The second options is in > qos_calc_rr we create a optimized lqr_pool. My thinking on this so far has been that we would dynamically skip entries while walking the OST index array, based on the current "weight" of each particular OST. The weight is potentially made up of many inputs. Currently the weight is the OST available space, less any penalty for recent allocations, but we may also want to consider having credits that are added onto the OST's weight if it has been skipped enough times. There is a per-OST penalty and a per-OSS penalty, which are very rough estimates for the amount of space that will be consumed on the OST and the remaining IO bandwidth of the OSS. I want it to be flexible enough to include other weighting factors into the mix (e.g. actual IO bandwidth or latency, performance impact due to rebuilding, administrator tuning, etc). > The pro for the second approach is that this special lqr_pool is > created only when a new OST is added or substracted to a OSS. Thus > eliminating the the need to do a balancing calculation when we are > allocating stripes. Maybe I'm missing something, but even when OSTs are not being added or removed from the OSS there will be a continous need for balancing. Various factors like current available space, recent allocation's using bandwidth, etc are constantly changing and need to be handled. That, IMHO, makes having a static list impractical, and the allocation should instead be recomputing weights frequently, and deciding whether an OST is suitable or not by one (or a few) simple threshold checks to decide whether the OST will be used or not. It is important to design the algorithm in a way that it can be incrementally updated as new data arrives (e.g. statfs refresh) if at all possible, instead of requiring a global refresh, because in very large filesystems (i.e. Spider) there are a LOT of OSTs and rescanning all of them would be inefficient. For example, for the available space tracking I would imagine computing using the average available space on all OSTs as a threshold, and to compute the sum(available space) over all OSTs it would track the old available space value, and then when a new statfs arrives it adds the difference between the old and new values into the sum. As OSTs in that pool are selected for an allocation they will get a temporary penalty added which is the "average file size / num_stripes", but after some time (not sure how long) the temporary penalty is replaced by the actual statfs data (maybe with a decaying penalty average).
> > The pro for the second approach is that this special lqr_pool is > > created only when a new OST is added or substracted to a OSS. Thus > > eliminating the the need to do a balancing calculation when we are > > allocating stripes. > > Maybe I'm missing something, but even when OSTs are not being added > or removed from the OSS there will be a continous need for balancing. > Various factors like current available space, recent allocation's > using bandwidth, etc are constantly changing and need to be handled. This is only dealing with the case for OSS having different numbers of OSTs. For the case of say a system with 2 OSS. One OSS has 2 OSTs and the other OSS has 4 OSTs. For this case the OSS with less OSTs will have to be selected twice as much to balance the data being processed. This is of course assuming both OSSs have the same network bandwidth. At this point I don't see a easy way to figure that out. But to make things simple lets assume network bandwidth is the same going to both disk. This is not to say that we are not going to handle the case of the the disk space being filled up. Assuming in the above case all disk are equal in size then the OSS with less OSTs will fill faster thus a counter weight that needs to be calculated during each striping to balance it out. Now if the OSS that had 2 OSTs was twice as large as the other disk then if all disk where being hit at the same time then the system would be very balanced. So I'm not removing the weighing while striping calculations. Just attempting to figure out the optimal arrangement of the pool before any striping begins. What got me thinking about this was the fact alloc_specific doesn't arrange the osts to optimize the OSS bandwidth like RR does. Now if we handle the case of unbalanced network bandwidth then its a different ball game. Is there anyway to know this from the lnet layer?
(In reply to comment #13) > This is only dealing with the case for OSS having different numbers > of OSTs. For the case of say a system with 2 OSS. One OSS has 2 OSTs > and the other OSS has 4 OSTs. For this case the OSS with less OSTs > will have to be selected twice as much to balance the data being processed. That would be needed to balance the network bandwidth, yes. It wouldn't be needed for balancing the space usage or the disk IO bandwidth (assuming that each of the OSTs has the same IO bandwidth). Ideally, we would have both. I think the most common setup is where each OSS has the same number of OSTs, but they are not the same size. As a first-level approximation it should be assumed that the disk IO bandwidth of the OSTs is the same. > Assuming in the above case all disk are equal in size then the > OSS with less OSTs will fill faster I don't think this is correct. Same-sized OSTs should individually fill at about the same speed, if the objects are uniformly allocated and the file sizes are also the same. It does mean that the OSS bandwidth usage is not balanced. > So I'm not removing the weighing while striping calculations. > Just attempting to figure out the optimal arrangement of the pool > before any striping begins. What got me thinking about this was the > fact alloc_specific doesn't arrange the OSTs to optimize the OSS > bandwidth like RR does. Well, if someone is using alloc_specific (which means they have specified a starting OST index, which is not very common) then the assumption is that the user wants specific OSTs to be used. > Now if we handle the case of unbalanced network bandwidth then > its a different ball game. Is there anyway to know this from > the lnet layer? Each OST keeps stats locally for the IOs, including the bytes read and written and the service time. This could be extracted and returned to the MDS as part of the statfs request. While this is a desirable feature, and it is worthwhile to plan for additional weighting factors in the new allocator code, I don't think it is required to implement this in the first version. One possibility is to start with some static bandwidth value per OSS and OST (e.g. 1GB/s and 200MB/s respectively) and then base decisions on these values, and then later implement the functionality that will make these "real" values.
from bug 9862, rr allocator for different-OST-count OSS's looks like this: (the same letter represents the different OSTs on a single OSS) 3: AAA 3x3: ABABAB 3x4: BBABABA 3x5: BBABBABA 3x5x1: BBABABABC 3x5x2: BABABCBABC 4x6x2: BABABCBABABC I think this is the "optimized lqr_pool" James refers to in comment #11, and my 0) from comment #7. New file asks for X stripes, we move a pointer along this list and give it the X stripes, and leave the pointer pointing to the next stripe to be parceled out. This rr pool is (currently) only used if space on all OSTs are approximately equal. The way I see it, this should be the starting point for the OST index array, of which Andreas thinks "we would dynamically skip entries while walking the OST index array" in comment #12. In other words, the array that we're starting from is not simply the array of OST indexes, which is just the order in which the OSTs were added. Starting it as above means ignoring any other weighting factors, we load the OSS's as evenly as possible. An additional weighting factor/penalty for lightly/heavily OST'ed OSS's is independent of this, and should depend on relative network/disk performance (and hence is tunable in the current implementation). Other things to be clear about here: 1. Files may have multiple stripes. That's the only reason for the above sorting; if all files had a stripecount of 1 the entire above argument doesn't apply - assuming large number of files relative to number of OST's, we will access all OSTs equally. But for a file with stripecount > 1, we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS weighting in current code. This actually should be refined - it should be recently-used-within-this-file's-stripe-allocation). 2. Not all stripes are used. If a file is given a stripecount of 5, but the file is only 3MB long, then 2 stripes are empty. This comes into play if num OSTs mod default stripecount = 0, so that some OSTs never get filled stripes. We had a "precessor" value that would periodically skip a stripe to avoid this in the past, but it could probably be abandoned given we now check free space. Something that bothers me about using weights at all - we're comparing apples to oranges to kumquats with a weighting algorithm, and some people are kumquat haters. There needs to be a uniform, clear way to allow users to choose the relative factors. All need to be able to be dialed to 0 if desired. OST %full free_space(MB) OSS# precreate_status disk_speed(MB/s) network_speed(MB/s) 0000 60% 500 1 2 40 20 0001 20% 100 2 0 80 50 0002 50% 500 1 1 20 20 Which ost do we use next? What is our goal? Greatest average transfer time? Uniform free space? (e.g. I actually want stripe #0 on my SSD drive for all files so that Windows file browser can draw icons quickly. If I bump up my weighting for disk speed, will more stripes on average live there?) Maybe this is all better done with pools, and we just have hard limits for free space imbalance and deal better with that case only (comment #3). We need to clarify the scope of what we are trying to accomplish here.
Created an attachment (id=26578) [details] First pass at optimized QOS disk space allocator I apologize for the delay. I was working on some other projects as well. This is a proof of concept patch for peer review. As you can see I went with the weight testing in the loop instead of a specially created OST index array. It really is just a basic weighted RR. Any OST below the weight is not used. The range of the weight is from the least available spaced OST to the most freely available spaced OST. In the pure round robin case we set the weight at the smallest disk to ensure no one gets skipped. In qos space mode we go from the most full to the least full. I removed the penalties for now, plus I haven't tested at scale yet. Test 116 passed of course and well as regular sanity test. I also test at greater imbalances. Just using test 116 settings I get a average of 350% more writes to the more full disk. When I used a 90% full OST the writes where 2500% more to the emptier disk.
> Other things to be clear about here: > 1. Files may have multiple stripes. That's the only reason for the above sorting; if all files > had a stripecount of 1 the entire above argument doesn't apply - assuming large number of files > relative to number of OST's, we will access all OSTs equally. But for a file with stripecount > > 1, we absolutely want to spread them out across OSS's as much as possible (hence the > > recently-used-OSS weighting in current code. This actually should be refined - it should be > recently-used-within-this-file's-stripe-allocation). I was thinking the same thing. Currently I have addressed the issue with disk space balancing. What I noticed is that if one disk in the system is added and it is empty that disk with be hit hard. Hence all OSS are not used evenly. Now the greater imbalance the faster it balanced all the disks which means a flood to the OSS that the disk is on. Now say if the OSS is connected by a higher output network connection, thus can handle more data being thrown at the OST, then this becomes less of a problem. Now if the opposite is true then the other OSS nodes have to be heavily favored. So data rates into the OSS matter. I just haven't translated that into code yet. > 2. Not all stripes are used. If a file is given a stripecount of 5, but the file is only 3MB > long, then 2 stripes are empty. This comes into play if num OSTs mod default stripecount = 0, so > that some OSTs never get filled stripes. We had a "precessor" value that would periodically skip > a stripe to avoid this in the past, but it could probably be abandoned given we now check free > space. That should not be the case any longer. As for the "precessor" value I assume you mean lqr_start_idx. Now it makes sense why you had it there in the first place. I guess we can remove it now. > Something that bothers me about using weights at all - we're comparing apples to oranges to > kumquats with a weighting algorithm, and some people are kumquat haters. There needs to be a > uniform, clear way to allow users to choose the relative factors. All need to be able to be dialed > to 0 if desired. > OST %full free_space(MB) OSS# precreate_status disk_speed(MB/s) network_speed(MB/s) > 0000 60% 500 1 2 40 20 > 0001 20% 100 2 0 80 50 > 0002 50% 500 1 1 20 20 > Which ost do we use next? What is our goal? Greatest average transfer time? Uniform free space? Currently the goal is uniform free space. Also we want to maximize loading as well. I agree it needs to use admin tunable. Thus we have rr_threashold and I need to get prio_space working.
(From update of attachment 26578 [details]) Inspection template(s): Bug: 18547 Developer: simmonsja@ornl.gov Size: 173 Lines of Change Date: 2009-10-14 Defects: Type: CODE Inspector: adilger@sun.com -------------- >@@ -683,6 +683,9 @@ struct lov_qos { >+ __u64 lq_ba_max; >+ __u64 lq_ba_min; >+ __u64 lq_hcf; (style) Please add comments for these fields. In particular, what does "hcf" mean? Can you please add a big comment block to this file which explains the algorithm that you have implemented. I think from your description of the "overloading" of the most-empty OST(s) that you have missed out on some important features of the previous algorithm, namely that it would also add a penalty to the just-selected OSTs and OSSes so that they would not immediately be selected again. The penalty is reduced as other OSTs/OSSes are used for allocations, and it also decays with time, so the most-empty OST(s) will still be selected more often than the others. >+/* Greatest common divisor */ >+unsigned long gcd(unsigned long a, unsigned long b) >+{ >+ unsigned long r; >+ >+ if (a < b) >+ swap(a, b); >+ while ((r = a % b) != 0) { >+ a = b; >+ b = r; >+ } >+ return b; >+} This seems like a relatively expensive operation - how often is this called? > static int qos_calc_ppo(struct obd_device *obd) > { >+ down_write(&lov->lov_qos.lq_rw_sem); > if (!lov->lov_qos.lq_dirty) > GOTO(out, rc = 0); > > num_active = lov->desc.ld_active_tgt_count - 1; > if (num_active < 1) >+ GOTO(out, rc = 0); > >+ /* Count the number of usable OSS's */ >+ lov->lov_qos.lq_active_oss_count = 0; > /* find bavail on each OSS */ > list_for_each_entry(oss, &lov->lov_qos.lq_oss_list, lqo_oss_list) { >+ if (oss->lqo_ost_count) >+ lov->lov_qos.lq_active_oss_count++; > oss->lqo_bavail = 0; > } > > /* How badly user wants to select osts "widely" (not recently chosen > and not on recent oss's). As opposed to "freely" (free space > avail.) 0-256. */ > prio_wide = 256 - lov->lov_qos.lq_prio_free; > >+ lov->lov_qos.lq_ba_max = 0; >+ lov->lov_qos.lq_ba_min = (__u64)(-1); > now = cfs_time_current_sec(); > /* Calculate OST penalty per object */ > /* (lov ref taken in alloc_qos) */ >@@ -189,12 +210,15 @@ static int qos_calc_ppo(struct obd_devic > temp = TGT_BAVAIL(i); > if (!temp) > continue; > >+ lov->lov_qos.lq_ba_max = max(temp, lov->lov_qos.lq_ba_max); >+ lov->lov_qos.lq_ba_min = min(temp, lov->lov_qos.lq_ba_min); >+ if (i) >+ lov->lov_qos.lq_hcf = gcd(temp, lov->lov_qos.lq_hcf); >+ else >+ lov->lov_qos.lq_hcf = temp; >+ >+ lov->lov_tgts[i]->ltd_qos.ltq_weight = temp; > lov->lov_tgts[i]->ltd_qos.ltq_oss->lqo_bavail += temp; > > /* per-OST penalty is prio * TGT_bavail / (num_ost - 1) / 2 */ >@@ -250,7 +274,9 @@ static int qos_calc_ppo(struct obd_devic > /* If each ost has almost same free space, > * do rr allocation for better creation performance */ > lov->lov_qos.lq_same_space = 0; >+ if ((lov->lov_qos.lq_ba_max * (256 - lov->lov_qos.lq_threshold_rr)) >> 8 < >+ lov->lov_qos.lq_ba_min) { >+ /* This will force use of all disk in RR */ > lov->lov_qos.lq_same_space = 1; > /* Reset weights for the next time we enter qos mode */ > lov->lov_qos.lq_reset = 1; >@@ -258,99 +284,9 @@ static int qos_calc_ppo(struct obd_devic > rc = 0; > > out: >+ up_write(&lov->lov_qos.lq_rw_sem); > >+ RETURN(rc); > } > >+static int alloc_stripe(struct lov_obd *lov, struct ost_pool *osts, >+ int *idx_arr, int array_start, int *stripe_cnt, >+ int stripe_cnt_min, struct lov_qos_rr *lqr) >+{ >+repeat_find: >+ if (lov->lov_qos.lq_same_space) >+ weight = lov->lov_qos.lq_ba_min; >+ else >+ weight = lov->lov_qos.lq_ba_max; >+ array_idx = array_start; >+ idx_pos = idx_arr; >+ >+ do { (style) please use spaces for indenting instead of tabs (in several places). >+/* return new alloced stripe count on success */ >+static int alloc_idx_array(struct obd_export *exp, struct lov_stripe_md *lsm, >+ int newea, int **idx_arr, int *arr_cnt, int flags) > { >+ >+ pool = lov_find_pool(lov, lsm->lsm_pool_name); > if (pool == NULL) { >- osts = &(lov->lov_packed); > lqr = &(lov->lov_qos.lq_rr); >+ osts = &(lov->lov_packed); No need to change the order of these, and it is more consistent to keep the assignmentsin the same order in the "else" clause. > } else { > down_read(&pool_tgt_rw_sem(pool)); > osts = &(pool->pool_obds); > lqr = &(pool->pool_rr); > } > >+ rc = qos_calc_rr(lov, osts, lqr); >+ if (rc) >+ GOTO(out, rc); >+ osts = &lqr->lqr_pool; > > rc = qos_calc_ppo(exp->exp_obd); > if (rc) > GOTO(out, rc); Calling qos_calc_ppo() for every object allocation could be very expensive. It doesn't seem that this has been implemented in an incremental manner, and I think in a case like Spider with 1300+ OSTs, it might be too much. > >+ if (newea || >+ lsm->lsm_oinfo[0]->loi_ost_idx >= lov->desc.ld_tgt_count) >+ rc = alloc_rr(lov, tmp_arr, &stripe_cnt, osts, lqr, flags); >+ else >+ rc = alloc_specific(lov, lsm, tmp_arr, osts);
James A Simmons wrote: > For bug 18547 you talked about refining the weighting code with a > recently used within ths file's stripe allocation. Do you mean using > the time difference between when the stripe was created and when the > stripe was filled with data? No, recently-used weighting is about ordering, not time. A recently used OSS is one that had a stripe allocated from it more recently than other OSS's. The paragraph I think you're referring to is this: 1. Files may have multiple stripes. That's the only reason for the above sorting; if all files had a stripecount of 1 the entire above argument doesn't apply - assuming large number of files relative to number of OST's, we will access all OSTs equally. But for a file with stripecount > 1, we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS weighting in current code. This actually should be refined - it should be recently-used-within-this-file's-stripe-allocation). The current code does this: When allocating a stripe from an OSS (i.e. any OST on that OSS), add a penalty to that OSS. Decay that penalty as stripes are allocated from other OSS's. When calculating the weight for any OST, include the per-OSS penalty for it's OSS in the calculation. What it should be is this: When allocating a stripes for any single file, try to spread them across OSS's. Using my nomenclature from comment #15, OSTs: AABBBB file1: ABAB file2: AB file3: B file4: B That is, ignoring any other reasons for allocating on or avoiding a particular OST, we should spread stripes for individual files across OSS's so that network bandwith is more frequently parallelized. Try to avoid allocations like: file1: BBBB file2: AA
> >@@ -683,6 +683,9 @@ struct lov_qos { > >+ __u64 lq_ba_max; > >+ __u64 lq_ba_min; > >+ __u64 lq_hcf; > > (style) Please add comments for these fields. In particular, what > does "hcf" mean? Highest common factor or also know as greatest common factor. > Can you please add a big comment block to this file which explains > the algorithm that you have implemented. I think from your description > of the "overloading" of the most-empty OST(s) that you have missed out on > some important features of the previous algorithm, namely that it would > also add a penalty to the just-selected OSTs and OSSes so that they > would not immediately be selected again. The penalty is reduced as other > OSTs/OSSes are used for allocations, and it also decays with time, so the > most-empty OST(s) will still be selected more often than the others. No, I only did disk balancing so far. Next is OSS balancing. > >+/* Greatest common divisor */ > >+unsigned long gcd(unsigned long a, unsigned long b) > >+{ > >+ unsigned long r; > >+ > >+ if (a < b) > >+ swap(a, b); > >+ while ((r = a % b) != 0) { > >+ a = b; > >+ b = r; > >+ } > >+ return b; > >+} > > This seems like a relatively expensive operation - how often is this > called? When QOS is enabled and for each OST. I found using the GCD reduces the loops in alloc_stripe. I have been pondering another approach. > >+static int alloc_stripe(struct lov_obd *lov, struct ost_pool *osts, > >+ int *idx_arr, int array_start, int *stripe_cnt, > >+ int stripe_cnt_min, struct lov_qos_rr *lqr) > >+{ > >+repeat_find: > >+ if (lov->lov_qos.lq_same_space) > >+ weight = lov->lov_qos.lq_ba_min; > >+ else > >+ weight = lov->lov_qos.lq_ba_max; > >+ array_idx = array_start; > >+ idx_pos = idx_arr; > >+ > >+ do { > > (style) please use spaces for indenting instead of tabs (in several places). Fixed and hopefully I found them all. > >+/* return new alloced stripe count on success */ > >+static int alloc_idx_array(struct obd_export *exp, struct lov_stripe_md *lsm, > >+ int newea, int **idx_arr, int *arr_cnt, int flags) > > { > >+ > >+ pool = lov_find_pool(lov, lsm->lsm_pool_name); > > if (pool == NULL) { > >- osts = &(lov->lov_packed); > > lqr = &(lov->lov_qos.lq_rr); > >+ osts = &(lov->lov_packed); > > No need to change the order of these, and it is more consistent to > keep the assignmentsin the same order in the "else" clause. Cleaned up. > > } else { > > down_read(&pool_tgt_rw_sem(pool)); > > osts = &(pool->pool_obds); > > lqr = &(pool->pool_rr); > > } > > > >+ rc = qos_calc_rr(lov, osts, lqr); > >+ if (rc) > >+ GOTO(out, rc); > >+ osts = &lqr->lqr_pool; > > > > rc = qos_calc_ppo(exp->exp_obd); > > if (rc) > > GOTO(out, rc); > > Calling qos_calc_ppo() for every object allocation could be very > expensive. It doesn't seem that this has been implemented in an > incremental manner, and I think in a case like Spider with 1300+ > OSTs, it might be too much. In the old code qos_calc_ppo was also always called. By using the lq_dirty flag you avoid redoing the calculations all the time. Only in QOS are the calculations considered.
(From update of attachment 26578 [details]) Inspection template(s): Bug: 18547 Developer: simmonsja@ornl.gov Size: 173 Lines of Change Date: 2009-10-19 Defects: Type: CODE Inspector: nathan.rutman@sun.com -------------- First of all, thanks for the great reduction in common code. Looks like the guts are now all in alloc_stripe. Trying to figure out the general idea from there: Assuming osts are imbalanced and !lqr and other simplifications, always put the first stripe on the array_start ost. For the remaining stripes, choose in order the osts with the most available space. Assuming balanced osts, start with array_start and use the osts in pool order. It looks like lq_threshold_rr controls the balance/imbalanced decision, but once that decision is made we either go fully rr or fully free-space. So although we are removing the randomness, which is probably good, do we really want to be this all-or-nothing about free space? Maybe so. A more complete text description of the algorithm is probably more useful than code at this point. What I don't see here is (what I think is) one of the key points (from comment #3): "If the weighting was subtle e.g. round-robin, but skipping more-full OSTs 1/n of the times (where n is proportional to the imbalance) then it could essentially be in use all of the time. If we implemented QOS as a part of the round-robin allocator that is always active, just by skipping OSTs periodically (rather than randomly) it would provide much more uniform loading and would prevent the OSTs from getting far out of balance in the first place." In other words, instead of the all-or-nothing rr or free space, I was imagining instead we would make lq_threshold_rr a multiplier of how often we skip any osts to try to restore the space balance. An "aggressiveness" factor for how quickly we want to restore even loading. You might argue this patch kind of does that, but it looks like it skips the more-full OST's continuously, until they are no longer more-full. I was hoping for a subtler/slower mechanism.
(In reply to comment #21) > What I don't see here is (what I think is) one of the key > points (from comment #3): "If the weighting was subtle e.g. > round-robin, but skipping more-full OSTs 1/n of the times > (where n is proportional to the imbalance) then it could > essentially be in use all of the time. If we implemented > QOS as a part of the round-robin allocator that is always > active, just by skipping OSTs periodically (rather than > randomly) it would provide much more uniform loading and would > prevent OSTs from getting far out of balance in the first place." > > In other words, instead of the all-or-nothing rr or free > space, I was imagining instead we would make lq_threshold_rr > a multiplier of how often we skip any osts to try to restore > the space balance. An "aggressiveness" factor for how quickly > we want to restore even loading. > > You might argue this patch kind of does that, but it looks > like it skips the more-full OST's continuously, until they > are no longer more-full. I was hoping for a slower mechanism. I agree completely. I was thinking more along the lines of keeping a penalty per OST and per OSS, and then just keeping the OST array in the same order always. We would hit the most-empty OST more often, not by virtue of sorting the array, but by virtue of skipping the less-empty OSTs more often. I would hope that the weighting/penalties on the OSTs would be such that we don't pound on the most-empty OST repeatedly and then skip every other OST when traversing the OST list, but if lq_threshold_rr is set such that space balancing has a high priority over uniformity then that is what would happen. In all cases, we wouldn't allocate from the same OST more than once per file, as that is a sure way to hurt performance.
(In reply to comment #19) > James A Simmons wrote: > > For bug 18547 you talked about refining the weighting code with a > > recently used within ths file's stripe allocation. Do you mean using > > the time difference between when the stripe was created and when the > > stripe was filled with data? > No, recently-used weighting is about ordering, not time. A recently used OSS is one that had a > stripe allocated from it more recently than other OSS's. The paragraph I think you're referring to > is this: > > 1. Files may have multiple stripes. That's the only reason for the above sorting; if all files had > a stripecount of 1 the entire above argument doesn't apply - assuming large number of files > relative to number of OST's, we will access all OSTs equally. But for a file with stripecount > 1, > we absolutely want to spread them out across OSS's as much as possible (hence the recently-used-OSS > weighting in current code. This actually should be refined - it should be > recently-used-within-this-file's-stripe-allocation). > > The current code does this: > When allocating a stripe from an OSS (i.e. any OST on that OSS), add a penalty to that OSS. Decay > that penalty as stripes are allocated from other OSS's. When calculating the weight for any OST, > include the per-OSS penalty for it's OSS in the calculation. > > What it should be is this: > When allocating a stripes for any single file, try to spread them across OSS's. > Using my nomenclature from comment #15, > OSTs: AABBBB > file1: ABAB > file2: AB > file3: B > file4: B > That is, ignoring any other reasons for allocating on or avoiding a particular OST, we should > spread stripes for individual files across OSS's so that network bandwith is more frequently > parallelized. Try to avoid allocations like: > file1: BBBB > file2: AA I have been pondering this. So for the case of the stripe count being less than the number of OSS we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally selecting the best and using them will make them not so great so they have to be marked so not to be used until all OSS are used.
(In reply to comment #23) > I have been pondering this. So for the case of the stripe count being less than the number of OSS > we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file > will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It > appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. > Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally > selecting the best and using them will make them not so great so they have to be marked so not to > be used until all OSS are used. Just to be clear, we're using OSS and OST in very particular senses here - they are not synonyms. When you say "64 OSS's" you scare me... There really is no "best" OSS, although you might say in general that it's better to use OSS's whose total stripe count (sum of all stripes on all OSTs served by that OSS) is less than more-utilized OSS's. We can maybe think about it another way: assume the rr allocator is already choosing an order of OSTs such that the list of OSTs spreads nicely among the OSS's. Three OSS's with 2 OST's each looks like: (maybe clearer nomenclature, we'll use the OST #n on OSS letter X: A1 is first ost on oss A.) A1 B1 C1 A2 B2 C2 Allocating stripes in order from the list for files of various stripe counts might look like this: file1: A1 file2: B1 C1 A2 file3: B2 file4: C2 A1 which is fine. The problem comes if say C1 and A2 are now pretty full: file5: B1 (skip full C1) (skip full A2) B2 Now we have two stripes for a two-stripe file on a single OSS, meaning clients cannot access the stripes in parallel. What we want is: file5: B1 (skip full C1) (skip full A2) (skip sameoss B2) C2 On the other thread here, "full" is a matter of degree, as should be "skip full", but "skip sameoss" is more of an imperative (unless, of course, there is only one oss). 1. If the number of stripes <= # available OSS's, each stripe should be on a unique OSS 2. If the number of stripes > # available OSS's, we should try to spread the stripes across OSS's as evenly as possible 3. If all the OSTs on a particular OSS are overfull, we should treat that OSS as less available. There is a tradeoff here with #2, which the qos_prio_free affected. So in fact, we need to break up qos_prio_free and qos_prio_wide into separate adjustments -- they are not simply the converse of each other, but rather both affect how far we are willing to stray from our optimal rr ordering along two different dimensions. High qos_prio_free will make us more aggressively space balance (skip fuller osts), and high qos_prio_wide will make us more aggressively insure goal #2 in the face of all other factors (free space balance, recently used osts, recently used oss's, anything else we throw into the mix). If qos_prio_wide and qos_prio_free are both set to 0, we will always use the optimal rr ordering (except for the anything else factors. Maybe we need a third (or more) adjusters: qos_prio_rr which tries harder to ignore the anything else's).
(In reply to comment #24) > (In reply to comment #23) > > I have been pondering this. So for the case of the stripe count being less than the number of OSS > > we pick the "best" OSSs. Say for a system with 64 OSS and default stripe count 4. The first file > > will pick the 4 best OSS. Now the question is what about the second, three pick and so on. It > > appears the correct thing to do is pick the next best batch of OSS until we have hit all OSS. > > Then we start over will all the OSSs fresh and pick the best of the best and so on. Normally > > selecting the best and using them will make them not so great so they have to be marked so not to > > be used until all OSS are used. > > Just to be clear, we're using OSS and OST in very particular senses here - they are not synonyms. > When you say "64 OSS's" you scare me... I really was talking about the OSSs. > There really is no "best" OSS, although you might say in general that it's better to use OSS's > whose total stripe count (sum of all stripes on all OSTs served by that OSS) is less than > more-utilized OSS's. We can maybe think about it another way: assume the rr allocator is already > choosing an order of OSTs such that the list of OSTs spreads nicely among the OSS's. Three OSS's > with 2 OST's each looks like: > (maybe clearer nomenclature, we'll use the OST #n on OSS letter X: A1 is first ost on oss A.) > A1 B1 C1 A2 B2 C2 > Allocating stripes in order from the list for files of various stripe counts might look like this: > file1: A1 > file2: B1 C1 A2 > file3: B2 > file4: C2 A1 > which is fine. The problem comes if say C1 and A2 are now pretty full: > file5: B1 (skip full C1) (skip full A2) B2 > Now we have two stripes for a two-stripe file on a single OSS, meaning clients cannot access the > stripes in parallel. What we want is: > file5: B1 (skip full C1) (skip full A2) (skip sameoss B2) C2 What I was asking is does the OSS order have to be always the same i.e ABC ABC ABC verses ABC BAC ACB etc. That is what I meant by best OSS.
(In reply to comment #23) > > No, recently-used weighting is about ordering, not time. > > A recently used OSS is one that had a stripe allocated > > from it more recently than other OSS's. I would add a caveat here, Nathan, that you may not be aware of. The QOS code was changed by me to decay the OSS and OST penalties over time (I think they reach 0 in at most 40 minutes, on the assumption that HPC jobs do IO every hour and then sit idle between those times). Otherwise, what happens is that a "good" OST gets a big penalty, and sits idle for a long time and then is ignored by later allocations even though there is no more IO going on. [note s/OSS/OST/ as needed to clarify below comments] > I have been pondering this. So for the case of the stripe > count being less than the number of OSS we pick the "best" > OSSes. Say for a system with 64 OSTs over 8 OSS nodes, and > default stripe count 4. The first file will pick the 4 best > OSTs. Right, with the value deciding "best" for each OSS and each OST being a time- and load- varying value, WITH HISTORY. > Now the question is what about the second, three pick and > so on. It appears the correct thing to do is pick the next > best batch of OSS until we have hit all OSS. Now, remember there are multiple variables in play here already (2 for the current QOS, it would be good to have more): 1) which OSS nodes were just used for the previous create 2) which OSTs were just used for the previous create In the current QOS code there is a "penalty" associated with each OSS and a separate one associated with each OST. Those penalties are in terms of some amount of "negative" available space on the OST and the number of OSTs per OSS. When an OST (on a particular OSS) is selected for an object creation, its penalty is increased, making it a "less good" OST (and hence OSS) to select for the next file. When an OST/OSS is NOT selected for a particular file, its penalty is decreased some fraction. The penalties are chosen today so that when all OSS nodes have been selected once the first OSS's penalty is zero and it will be selected again, and similarly once all OSTs have been selected once the first OST's penalty will be zero again. The bad thing about the current implementation is that the weight (available - penalty) is only used as a probability for a random number selection. Statistically the allocations will be even, but in practice they are not. When the space or OST distribution is non-uniform, then the penalties + available space do not balance out totally evenly and statistically the allocations will favour OSTs with more free space. > Then we start over will all the OSSs fresh and pick the > best of the best and so on. The one thing to watch out for here is you don't _necessarily_ want to select all of the OSSes, or all of the OSTs on every pass, if they are far below optimal. > Normally selecting the best and using them will make them > not so great so they have to be marked so not to > be used until all OSS are used. Again, they should be marked not to be used until they are no longer worse than any other OSS to be used. I think one thing that could be used to make the new QOS allocator MUCH better than the old one is to use "real" values for the IO bandwidth of the OSTs and network bandwidth of the OSS nodes, so that they can be compared objectively instead of assuming all OSTs are equal except for available space, and all OSSes have bandwidth that is proportional to the number of OSTs thereon. Since the MDS already is doing OST_STATFS RPCs to the OSTs every 5s, we can return a fairly good idea of the actual disk bandwidth of the OSTs and of the whole OSS. I don't suggest that these values be used to determine the instantaneous load on an OSS/OST (or if they do, they should decay with time as they become less accurate) but rather as a measuring stick by which different OSTs/OSSes can be compared. Initially, I have no objection to using some artificial placeholder values for these, until we wire in the lprocfs and adaptive timeout stats to get real performance values at runtime. The important point is that the heuristics should be developed with these metrics in mind initially. We should also make "proxy" estimates on space usage for each file, based on the filesystem average space usage I'd suggest using the ORNL/DDN9900 test values as a starting point - 200MB/s per OST, 1200MB/s per OSS. That gives us an algorithm that looks something like what Nathan described (to do the initial OST ordering to spread out the OSSes evenly among the OST array), and then as we walk this array in RR order: statfs_interpret(new_osfs) { for each pool of this OST { pool->available += (new_osfs->b_avail - ost->osfs->b_avail); pool->avg_available = pool->available / pool->ost_count; pool->used_objects += (new_osfs->used_files - ost->osfs->used_files); pool->avg_object_size = pool->available / pool->used_objects; pool->bandwidth += (new_osfs->ost_bandwidth - ost->osfs->ost_bandwidth; pool->avg_ost_bw = pool->ost_bandwidth / pool->ost_count; pool->queue_depth += (new_osfs->oss_queue_depth - ost->osfs->oss_queue_depth); pool->queue_rate += (new_osfs->oss_queue_rate - ost->osfs->oss_queue_rate); } /* This is the same for all OSTs on the OSS, just use the latest value */ oss->max_bandwidth += (new_osfs->oss_max_bandwidth - ost->osfs->oss_max_bandwidth); oss->bw_penalty = (oss->bw_penalty + new_osfs->oss_cur_bandwidth) / 2; /* save old values for reuse next time we are called */ ost->osfs = new_osfs; } notify_add_del alloc_qos() { /* only update statfs data for OSTs in this pool */ qos_statfs_update(pool); restart: for (cur_stripes = 0, cur_idx = pool->next_index; cur_stripes < needed_stripes && cur_idx++ != pool->next_idx; cur_idx %= pool->max_idx) { /* OSC deactivated, ENOSPC, EROFS, etc */ if (ost_is_unusable(lov, cur_idx)) continue; /* This is where all of the heuristic complexity lies. * It needs to combine all of the available metrics to * determine which OSTs are the "best" ones to choose, * from increasingly out-of-date estimates of the current * load on the OSS/OSTs. */ if (qos_calc_weight(pool, cur_idx) < pool->threshold) continue; /* This is an acceptable OST/OSS to use, account for usage */ qos_used() cur_stripes++; } if (cur_stripes < needed_stripes) { reduce pool->threshold; recompute pool goto restart; } } qos_used() { /* create_counter allows us to determine in ost_current_value() * how long ago this OSS/OST was last used, instead of having * to to iterate over all of the OSTs/OSSes on each create to * adjust their penalties. */ counter = atomic_inc_ret(lov->create_counter); ost->last_used = counter; oss->last_used = counter; /* Reduce pool-weighted available resources for this OST/OSS. * They are per-pool to avoid skew by OSTs/OSSes outside pool. * We assume that an average of the OST bandwidth will be * used by this file, so that OSTs/OSSes with more real * bandwidth will get proportionately more objects created. */ ost->avail_penalty += avg_object_size; ost->bw_penalty += avg_ost_bw; oss->bw_penalty += avg_ost_bw; } The most important part of the algorithm is qos_calc_weight() and how we compute the pool->threshold value. Unfortunately, as Nathan previously called this, it is comparing apples to oranges to cumquats, so we need to normalize these values based on the weighting specified by the admin (and possibly the user, later?). Since the alloc_qos() algorithm above is always going through the OST list in round-robin ordering, we don't need to worry about completely eliminating the used OSTs from the selection list each time. qos_calc_weight(pool, ost_idx) { ost = pool->lov->osts[ost_idx]; oss = ost->oss; /* This OSS/OST has been "unused" since the last time the * counter was set, so (assuming even loading) it should * have recovered some fraction of its bandwidth. Even if * it hasn't we should assume that the incremental increase * in load by adding one more in-use file on an OSS is even * across all other OSSes - each one gaining the same fraction. * * [Note that "/= age" is probably a bad weighting function, * but is at least a starting point. Better would be something * like ">> (age / factor)" that is exponential decay.] */ ost_age = lov->create_counter - ost->last_used; oss_age = lov->create_counter - oss->last_used; ost->avail_penalty /= ost_age; ost->bw_penalty /= ost_age; oss->bw_penalty /= oss_age; oss->lat_penalty /= oss_age; /* this one should probably be computed differently */ weight = lq_prio_free * (ost->osfs->os_avail - ost->avail_penalty); weight += lq_prio_bw * (ost->bandwidth - ost->bw_penalty) + (oss->bandwidth - oss->bw_penalty); weight += lq_prio_latency * (oss->queue_depth - oss->queue_penalty); }
Raising rank, as this in now seen in bug 21258.
(In reply to comment #27) > Raising rank, as this in now seen in bug 21258. Could I be given permission to see the bug.
(In reply to comment #28) > (In reply to comment #27) > > Raising rank, as this in now seen in bug 21258. > > Could I be given permission to see the bug. Not sure, but the problem in 21258 is this: current allocator (1.8.1) sometimes allocates stripes from different OSTs on the same OSS before OSTs on different OSS's. I'm not sure it's doing anything "wrong", but optimally it would try to use different OSS's as much as possible. That's already one of the goals of this bug.
Updating Lustre 2.1 project priority: P5.