[LU-9809] RTDS(Real-Time Dynamic Striping): A policy based striping framework Created: 31/Jul/17 Updated: 27/Dec/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Li Xi (Inactive) | Assignee: | Li Xi |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Currently, there are several ways to control the striping of files: 1) Default striping inherited from parent directory 2) OST pool based striping inherited from parent directory 3) Specific striping on a user defined set of OSTs through creating with open The default or OST pool based striping tries to ensure that space is evenly The application can surely control the striping of the newly created files by That is the reason why we implemented a entirely new striping framework with RTDS controls the striping based on the allocating weights of the OSTs. When S[i] = W[0] + W[1] + W[2] + ... + W[i] In the implementation of RTDS, an RTDS tree is used to choose the OST according 1) The leaves of the RTDS tree is a array with the value of wights, i.e. W[i]; 2) The value of the none-leaf node is S[x]. x is the biggest index of the leave in its left 3) The left sub-tree of a none-leaf node should always be complete binary tree. Because of rule 3), if a RTDS tree has N leaves (N > 2), then its left sub-tree Following is the allocation tree for 8 OSTs.
When choosing an OST to allocate the object, the RTDS policy first generate Currently, only a single RTDS tree is used for default striping. In the future, By using the RTDS, administrator will be able to fully control the striping 1) When new OSTs are added into the file system, the administrator might want 2) If the file system have of OSTs, SSD based OSTs and HD based OSTs, then a 3) When OSTs are doing RAID rebuilding, there might be some performance 4) In order to provide better QoS guarantees, the administrator might implement Currently, RTDS are only implemented for file striping. However, this general If RTDS works well, codes of OST pool can be removed completely since RTDS |
| Comments |
| Comment by Li Xi (Inactive) [ 31/Jul/17 ] |
|
I will push the patch soon. |
| Comment by Gerrit Updater [ 01/Aug/17 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: https://review.whamcloud.com/28292 |
| Comment by Li Xi (Inactive) [ 01/Aug/17 ] |
|
Following is some examples of using RTDS. The test system has two OSTs. create.sh creates 1000 files, and check the index of the striping. [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'cat /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' 0=1,1=1 [root@server17-el7-vm2 rtds]# sh create.sh zero 483, one 517, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 501, one 499, all:1000 [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'echo "0=1,1=2" > /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'cat /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' 0=1,1=2 [root@server17-el7-vm2 rtds]# sh create.sh zero 344, one 656, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 347, one 653, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 341, one 659, all:1000 [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'echo "0=1,1=100" > /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'cat /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' 0=1,1=100 [root@server17-el7-vm2 rtds]# sh create.sh zero 13, one 987, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 12, one 988, all:1000 [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'echo "0=0,1=1" > /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'echo "0=0,1=1" > /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' [root@server17-el7-vm2 rtds]# ssh server17-el7-vm1 'cat /proc/fs/lustre/lod/vm1-MDT0000-mdtlov/rtds_weight' 0=0,1=1 [root@server17-el7-vm2 rtds]# sh create.sh zero 0, one 1000, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 0, one 1000, all:1000 [root@server17-el7-vm2 rtds]# sh create.sh zero 0, one 1000, all:1000 |
| Comment by Peter Jones [ 01/Aug/17 ] |
|
Andreas Can you please review this proposed feature and comment? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 01/Aug/17 ] |
|
Interesting work. A few questions: 1. Is there any tool running on the MDS that iterates all OSTs and adjusts weight of them on the fly? btw, is the RTDS tree invented by yourself, otherwise can you please tell me the name of the tree? |
| Comment by Jinshan Xiong (Inactive) [ 01/Aug/17 ] |
|
it also would be interesting by comparing RTDS with the current policy of object creation. For example, set one OST to be 50% full and another one completely empty; and then set the weight of RTDS to be 50 and 100 specifically, then allocating 1000 objects and investigate the distribution. Surely RTDS has a better control over object allocation due to the ability of setting the distribution on the fly. |
| Comment by Li Xi (Inactive) [ 02/Aug/17 ] |
|
1. Is there any tool running on the MDS that iterates all OSTs and adjusts weight of them on the fly? Yes. There should be a daemon on MDS which collects all the necessary information (e.g. free spaces or bandwiths on all OSTs) and adjusts the weights from time to time. The process of adjusting the weights is quick. It only needs to hold a write lock and change the values in a array. The complexity of the process is O(N). N is the number of OSTs. 2. When multiple objects are assigned to a file, what enhancement should be implemented to RTDS so that it won't allocate the objects from the same OST, or even more, from the same OSS? Good question. I was assuming that most files only have one stripe. I know users will use multiple stripe if they need to create really big files. But most customers always use stripe count=1. In order to avoid allocating multiple objects on the same OST, the most simple idea would be dropping the selected OST, if the OST has already been selected as one of the stripes of the file. However, this idea has some problems. Let's assume there are only two OSTs, and one OST is configured with weight of 1%, the other is configured with weight of 99%. Then it is likely the later OSTs will be selected and then discarded for a lot of times before the first OST is finally selected. It is a waste of time. Another idea would be: the weights of the OSTs are changed dynamically when allocating. When allocating multiple objects for a single file, if an OST has alread been choosen for one object, then the weight of that OST should be set to zero in the following allocation rounds for the file. And the weights should be changed back to original values after the allocation for the file is finished. However, this solution still has some problems: 2) Changing weights would cause congestion on the lock. Currently, allocting through RTDS only requires read lock, that means no thread will be blocked by another one. But if the thread need to write the RTDS tree, a write lock needs to be held. We can reduce the performance regression by copy multiple RTDS trees, but that would make the code much more complex. Maybe our conclusion is that RTDS might not be perfect solution for all striping policy. For example, RTDS can never satisfy the following requirement: when running benchmark, we need the stripes of multiple file be evenly located on all OSTs. We implemented a dull round robin policy for this requirement, and RTDS can not implment this. Maybe a general framework should be added into striping mechanisam. The former round-robin and QoS policy can be kept. And RTDS and other policies can be added into this framework. The users can choose from these policies, just like what they do to choose NRS policies. 3. If I understand it correctly, in order to implement OST pool by RTDS, a new RTDS tree will be created for each 'pool'. In the RTDS tree, the OSTs outside the pool are assigned with weight value 0 so that they won't be bothered in object allocation. This 'pool' RTDS can be attached to a directory and indicated in a CLI of file creation. However, what's the benefit by doing this? I am thinking of removing OST pool, and replacing it completely by RTDS. Multiple RTDS trees can be configured on MDS. When striping a file, one of these RTDS tree can be selected for the file according the file's attribute (UID/GID/ProjID/JobID etc.). In this way, we can control the striping based on policy rules. The administrator can focus on defining rules rather than changing the OST pool configurations on each individual directories/files. This will enable a lot of new use cases, and make Lustre look smarter. |
| Comment by Li Xi (Inactive) [ 02/Aug/17 ] |
|
it also would be interesting by comparing RTDS with the current policy of object creation. For example, set one OST to be 50% full and another one completely empty; and then set the weight of RTDS to be 50 and 100 specifically, then allocating 1000 objects and investigate the distribution. Absolutely, we need a lot of tests including your example. I am expecting a little bit performance improvement because the lock contention has been eliminated. Let's see what will be the performance results, as well as functional results. |
| Comment by Li Xi (Inactive) [ 02/Aug/17 ] |
|
is the RTDS tree invented by yourself, otherwise can you please tell me the name of the tree? Yeah, I created the tree structure. I am not sure whether there is any well-known existing structure/algorithm. It is not a complex tree, and I hope it works as expected. |
| Comment by Li Xi (Inactive) [ 03/Aug/17 ] |
|
After discussing with friends, we come across an idea of implementing Dull Round Robin has on RTDS: 1) Set the weight of all OSTs to 1 By changing the random number generator, we might be able to implement other policies based on RTDS. |
| Comment by Jinshan Xiong (Inactive) [ 03/Aug/17 ] |
Then you have to copy RTDS tree every single time when objects for striped files are allocated. It is possible to enhance the arrangement of the tree by clusters of OSTs? This can be easily done by encode some high bit in the weight to geographic information of OSTs so that it's easier for the LOD code to allocate an object from a cluster of OSTs, where they usually belong to the same OSS or rack. Let's say weight is a 32bit integer, and the highest byte is used to encode rack number, and the 2nd highest byte as OSS number. Therefore for multiple objects allocation for striped files, the LOD can pack a number as <rack> + <OSS> + <random> for RTDS tree traverse, and then change rack or OSS and new random number to allocate the next one, etc. |
| Comment by Andreas Dilger [ 05/Aug/17 ] |
|
I haven't had time to look at the patch, but have some comments on the general approach...
In summary, I think this idea has merit, but it needs to be integrated into existing functionality better before it can land (I'm assuming it is not, based on the comments here, but don't have time to look at the patch yet). |
| Comment by Andreas Dilger [ 05/Aug/17 ] |
|
PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
Andreas and Jinshan, thank you so much! Your comments are really helpful for me to understand the requirements! |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
I think there are some assumption or precondition when we are discussing about the striping policy. I think we all agree with the following: Limit-of-no-same-OST: It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned. Preference-of-no-same-OSS: It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored. I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly. Then you have to copy RTDS tree every single time when objects for striped files are allocated. As long as we can change the weights of OSTs dynamically when allocating objects for a single file, things become much easier. Limit-of-no-same-OST can be implemented by changing the weights of the selected OSTs to zero. Preference-of-no-same-OSS can be implemented in the similar way: first the change the weights of the OSTs on selected OSS to zero, and then change them back if no available OSTs are left. As mentioned before, multiple stripe for one file would cause the real ratios of objects to be different from the configured weights. However, I think in all cases, the following sentence is ture: positive-correlation-between-the-probability-and-weight: The statistical probability of selecting an OST would not be reduced by increasing the weight of that OST. And I think in most of (if not all) the cases, the following sentence is ture: same-order-for-probability-and-weight: If an OST A has smaller weight than OST B, then the possibility of selecting A would not be higher than the possibility of selecting B. Ideally, we should find a way to prove positive-correlation-between-the-probability-and-weight. And it would be really helpful if we find all the possible conditions in which same-order-for-probability-and-weight is false. Following is an example: Let's assume there are only three OSTs with weight of 1, 2, 4. And all the files will be created with stripe cound of 2. Then, the real ratio would be (1 + 2 * 1 / (1 + 4) + 4 * 1 / (1 + 2)) : (1 * 2 / (2 + 4) + 2 + 4 * 2 / (1 + 2)) : (1 * 4 / (2 + 4) + 2 * 4 / (1 + 4) + 4), i.e. 41 : 75 : 94. This is different from 1 : 2 : 4, but yet keeps same-order-for-probability-and-weight. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
??It is possible to enhance the arrangement of the tree by clusters of OSTs? This can be easily done by encode some high bit in the weight to geographic information of OSTs so that it's easier for the LOD code to allocate an object from a cluster of OSTs, where they usually belong to the same OSS or rack. Let's say weight is a 32bit integer, and the highest byte is used to encode rack number, and the 2nd highest byte as OSS number. Therefore for multiple objects allocation for striped files, the LOD can pack a number as <rack> + <OSS> + <random> for RTDS tree traverse, and then change rack or OSS and new random number to allocate the next one, etc.?? In the RTDS, the weight values is the value that indicates the probability of selecting an OST. I might not understand your ideal completely. I am not sure how can we assume rule Limit-of-no-same-OST by changing the meaning of weight. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
for this patch to land, it should include the current QoS functionality, so that the OST weights are based on OST available space if not otherwise specified. Otherwise, we are duplicating functionality unnecessarily. The idea of using RTDS is to implement a general framework with which the administrator or a user-space daemon can control the striping policy completely. So, indead of implementing the QoS functionality in side RTDS, a user-space tool should be provided. The tool will monitor the available spaces and set the OST weights from time to time. I would like to work on the tool in the following months. And it would be extremely easy to extend the tool to support QoS based on available bandwith. Do you think it is necessary to implement a kernel-space QoS control of RTDS? My perference would be no. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
Rather than using weighted random OST allocation, it would be better to do weighted round-robin allocation (see LU-9). The current QoS allocation is weighted random and lots of people complain about that because it doesn't allocate from OSTs very evenly in the short term (e.g. 3 objects from one OST, 4 from most OSTs, and 5 from one OST, which makes the aggregate performance of one OST 5/4=25% slower than most of the others). Using weighted round-robin allocations would avoid this problem, regardless of how the weights are generated. Weighted round-robin allocation with even usage in short tem is not what RTDS is good at, because RTDS depends a lot on the random number generator. However, I think by changing the random number generator to a sequence generator, it is possible to implement the weighted round-robin allocation. Let's assume that there are N OSTs, and the OST i has a weight of W[i]. In order to implement precise weighted round-robin allocation, the policy allocates S[N - 1] = W[0] + W[1] + W[2] + ... + W[N - 1] object in each round. And in each round, W[i] objects are allocated for the OST i. An allocating sequence of transfering values for RTDS tree needs to be generated. The value sequence includes all the numbers in 0, 1, 2, 3, ..., S[N - 1], but the order of the sequence needs to be changed in order to avoid allocating objects on the same OST continuously. The simplies way of generating the sequece would be simply shuffle the array randomly, .e.g. using Fisher–Yates shuffle (or Knuth shuffle). The maximum memory for the allocating sequence would be the maximum value of S[N - 1]. If the maximum weight is M, then maximum value of S[N - 1] is M * N. For 8 bit weight, it would be 255 * N Bytes. 1 MB memory would be able to support 4112 OSTs. The random shuffling time complexity is O(M * N). I think this implementation would works for some use cases. However, if the weights between OSTs have huge difference, shuffling or any other sequece generator won't help anything at all. For example, if the an OST has 100% of the weights, then the OST will be always selected continuously. Also, support for striped files needs extra effort in order to enforce Limit-of-no-same-OST or Preference-of-no-same-OSS. In order to select multiple objects on different OSTs/OSS, RTDS needs to scan the sequence to choose multiple OSTs/OSS, and then remove the selected value from the sequence. Thus, the time complexity of allocating a file with multiple objects would be O(M * N), by contrast the time complexity of allocating a file with a single object would be O(1). Also, this would requre extra memory to save the changed sequence, which has a size of O(M * N). |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
?? Yeah, I agree it is too aggressive to remove OST pool. I might need to get ride of this idea Currently, a RTDS tree is allocated for each OST pool. That means, the OST pool is fully supported by RTDS. Also, I am going to implement UID/GID/jobid/projid matching rules for OST pool/RTDS. So, RTDS will be able to add new features to OST pools. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
there is already a mechanism for OSTs to indicate that they are undergoing RAID rebuild or are otherwise degraded - the ofd.*.degraded parameter, which can be set locally on the OSS and is communicated to the MDS to avoid allocations on that OST if possible. Goot to know. I think this parameter works like mute button. And RTDS works like a volume knob. |
| Comment by Li Xi (Inactive) [ 07/Aug/17 ] |
|
??PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. ?? Agreed. As discussed before, striped files can be implemented by changing the weights of OSTs dynamically when allocating objects for a single file. |
| Comment by Andreas Dilger [ 07/Aug/17 ] |
I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this.
That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS. I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today. To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator? |
| Comment by Patrick Farrell (Inactive) [ 07/Aug/17 ] |
|
"... may need to break the no-same-OST rule in some cases..." While I haven't evaluated the larger proposals here, I wanted to note that I've had an idea on the backburner for a bit about controlled violations of this rule being highly desirable in certain cases. In particular, lock ahead is designed to address the case were we are limited to a single shared file, but each OST is significantly faster than one client. In a shared file situation, LDLM locking behavior limits us to writing with one client to each OST, so we are unable to fully drive each OST for the shared file. This is an interesting enough case for Cray to drive all the work in lock ahead use a library to achieve it. If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. For a variety of reasons, this doesn't remove the need for lockahead (Primarily because we cannot necessarily redefine file striping when we want to write to it), but it is much simpler, and highly desirable for that reason. In addition to the MPIIO aggregation case where we have well controlled I/O and are trying to maximize OST utilization, adding more stripes to a shared file also helps in cases where I/O is poorly controlled, so there are effectively more locks for the badly behaved writers to contend for. I've moved this as far as coding this up and testing it and observing it works great (once you've removed a number of OST-to-stripe-count ratio related sanity checks), but it needed a workable API and sane interaction/implementation with pools and default striping, and I didn't go that far (yet). So, in short, I think it would be very, very desirable if, in a controlled manner, we could ask for more than one stripe to be on a given OST. A simple example is something like "8 stripes but only on these 2 OSTs", giving 4 stripes per OST (and allowing 4 client writers per OST with no fancy locking work). This is a bit outside of the current discussion, but at the same time, it's a feature I'd love to see in any overhaul of the striping/allocation code. If it doesn't fit as part of this, I'll (eventually...) move ahead with it elsewhere. |
| Comment by Andreas Dilger [ 07/Aug/17 ] |
|
Patrick, this is totally achievable with PFL in 2.10. The caveat is that the MDS will still try to avoid this case if possible (exclude stripes on the same OSTs if they were already allocated to the file), unless the requested stripe count for the component exceeds the number of available non-duplicate OSTs. In that case it will allow duplicate OST stripes to be allocated, to handle cases where most/all of the OSTs were used in the first components, then the user wants a fully-striped file for the last component. There is no actual restriction on this, and if you explicitly request OSTs via "lfs setstripe -o" for different components you can force them to be overloaded. |
| Comment by Li Xi (Inactive) [ 08/Aug/17 ] |
|
I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this. Thanks for the info. I didn't realize the special requirement of PFL and FLR. The allocator indeed needs to collect a lot more information about the system to make advanced striping decisions. However, I don't think it is partical to collect all the original info in kernel-space LOD. The information needed by the allocator might change from time to time. As you have said, the NID, controller, rack, UPS of the OSTs are needed. And if the OSTs have different specifications with each other, the information becomes more complex. The storage types (SSD vs. HDD), network types (IB vs. Ethernet), controller specification and other information might be needed by the allocator. Of course, support for collecting more original info can be added into LOD gradually. But that would make the codes complex and hard to maintain. That is part of the reason why I am trying to abstract these info into OST weights that are dynamically changing. Ideally, when we are adding new features to allocator, the main framework of kernel-space codes can be kept untouched, and we only need to change the user-space daemon that updates the OSTs weights. The current RTDS implementation enables a lot of potential new features, e.g. balance of bandwidths between OSTs. And I'd like to emphasize that these features can be implemented by only changing user-space tool. That means, an experienced administrator could implement a customized policy by writing/changing a Python/Shell script without even knows how to do kernel space development. As an abstract of the various requirements, I think we can introduce a new concept: relative allocation weight between OSTs. The relative weight is a value between two OSTs. Let's say, the relative weight of OST i and OST j is RW(i, j). RW(i, j) means if the OST i is being selected as a stripe of a file, then the weight of OST j (originally W[j]) should be changed to W[j] = W[j] * RW(i, j) for allocating the next stripe of the file. RW(i, j) can be a value between zero and infinite. RW(i, j) is usually equal to RW(j, i), but they could be different. Like weights of OSTs, relative weights between OSTs could be changed from user space daemons. By using relative weights, a lot of features is enabled: 1) RW(i, i) can be set to zero to avoid allocating more than one objects on OST i for a single file. 2) For i != j, one is the default value of RW(i, i) for a simple allocating policy, because usually the OSTs are considered unrelated. 3) RW(i, j) can be set to smaller-than-one value if we prefer not to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j are on the same OSS. We might want to avoid locating two stripes on the same OSS for performance goals. 4) RW(i, j) can be set to bigger-than-one value if we prefer to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j have the same specification (e.g. SSD/HDD based OSTs). We might prefer to use same kind of OSTs for the stripes of a file. 5) An infinite (+INFI) value can be set to RW(i, j), if we want to ensure that the next stripe of the file will be on OST j if the file has a stripe on OST i. One example of this is: we want to locate all of the rest stripes of a file on the same selected OST, so we set the R(i, i) to +INFI. As a implementation detail, instead of actually doing the calculation of W[j] = W[j] * RW(i, j) when RW(i, j) is +INFI, we can set W[j] to one, and other weights to zero. It might be complex to analyze how relative weights would affect the process of file striping if relative weights are configured without taking enough care. But things would be much simpler if all the relative weights are 0, 1/2, 1, 2, or +INFI. And with these relative weight options, we can already implement a lot of feature. That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS. I must admit that I haven't looked into all the details of the current round-robin or QoS allocator. But I am surely aware that the current allocators handles a lot of details properly and works pretty well on what they are designed to. I am not expecting that RTDS can be used without a lot of tuning. And I think that will take months of efforts. But I think as soon as we get a stable and fast RTDS framework, adding more features would be much simpler than before. I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today. It is of course possible to implement a mechanism in kernel space that updates the OST weights. But I still don't understand why a user-space daemon can't do the work properly. If the daemon needed to change the weights every one micro second, that would be a problem. But since the weights only needs to be updated every few seconds, there shouldn't be any problem. And the biggest advantage of implementing the mechanism in user-space tool is flexibility. A tool written in C/Python/Shell or any other languages can be used. And the tool can be changed whenever it is found not work well. I don't think we have this flexibility if it is implemented completely in LOD. Also, collecting information in a user space daemon is much simpler than inside LOD. "ssh" commands can be used by the tool to login OSS and collect all of the necessary statistics. As a side example, we have implemented a tool (LIME, https://github.com/DDNStorage/Lime) which collects the real-time performance statistics of a job, and changes the TBF rates every one second to provide QoS guarantees or enforce performance limitations. This tool is written in python. And it works well at least in a very small cluster. I think a similar tool can be implemented for RTDS. To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator? I fully agree that the current QoS weighted random allocator works pretty well if there were no upcoming requirements of new features. And there might be better structure than RTDS which can do similar things. But to support the features that RTDS tree can, the structure needs to: 1) Support quick update of weights from user space tool. RTDS can finish the update in time of O(N), with N as the OST number. And no memory allocation is needed in this process. 2) Support dynamic weight changing during the process of allocating multiple objects for a single file. RTDS tree can support it by copy the weight array. That means, each allocating thread will only need to hold a read lock of the tree and copy the weight array. And then the weights can be changed freely without influencing other threads 3) Support weighted round-robin. RTDS tree can support it by changing the random value generator to a sequence generator based on weights. The time/memory cost has been analyzed before, which I think is affordable. I will need to investigate the current weighted random allocator more to check whether RTDS tree can be replaced by it. But I still like the design of RTDS tree, because it is simple and at the same time is very powerful. |
| Comment by Li Xi (Inactive) [ 08/Aug/17 ] |
|
If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. As discussed before, putting multiple stripes on a single OST can be implemented in RTDS by configuring relative weight. And an policy can be defined for this, which will enforce this only to a selected part of files (for example, only the files with a given project ID), while use normal striping policies for other files. |
| Comment by Nathan Rutman [ 16/Oct/17 ] |
|
diffusion allocator.xlsx Instead, we should aim for a fully-predictable, regular allocations. Andreas' "error diffusion allocator" idea seems fine; I've attached a spreadsheet showing how one implementation might work. Weights are still input from userspace (a good idea), and if the "error" (the difference between some fixed reference (e.g. max weight) and the per-OST weight) builds up until, after some threshold, the OST is skipped, and the error is reset. You can change weights at any time. It can accommodate the relative weights and the RTDS tree - a form of weighted round-robin. One other aspect needs to be addressed, which is that if a single file needs a certain number of OSTs (stripes), then the threshold might have to move to allow sufficient OSTs - we can't "randomly choose another OST" like you can with a probabilistic method. So instead of the output being a binary use/don't use, we need to end up with a cumulative-error sorted list, and if stripes_requested > stripes_available, change threshold so that stripes_avail = stripes_requested, for this allocation only. (Normally you want avail >> requested, so that we get round-robin type behavior most of the time.) |
| Comment by Andreas Dilger [ 23/Oct/17 ] |
|
I think one important thing to remember is that large imbalanced weights should be fairly rare. One of the more important benefits of the weighted round-robin allocator should be that it is always rebalancing the system. If the OST selection algorithms are stable (i.e. don't pile hundreds of new files on a single OST that has no IO load because it was just rebooted) then they should slowly drive all weights to be balanced all the time, so there should only be a little error accumulated on any given round. I agree with Nathan's suggestion that for widely-striped files we should select from the sorted list of available OSTs based on the minimum error, so that we accumulate the least additional error when using those OSTs. The current object allocation policy allows a file to be created if it has at least 3/4 of the requested stripes, so that file creation is not blocked/failed if some OSTs are offline (better to have 75% IO speed than none at all). |