[LU-9809] RTDS(Real-Time Dynamic Striping): A policy based striping framework - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

Currently, there are several ways to control the striping of files:

1) Default striping inherited from parent directory

2) OST pool based striping inherited from parent directory

3) Specific striping on a user defined set of OSTs through creating with open
flag O_LOV_DELAY_CREATE and then ioctl(LL_IOC_LOV_SETSTRIPE)

The default or OST pool based striping tries to ensure that space is evenly
used by OSTs and OSSs, thus the algorithm allocates OST objects according to
the free space on the OSTs. The algorithm is written in kernel space codes,
which means the administrators barely have any way to change the striping
policy even if they realize there are a better policies than the current one.

The application can surely control the striping of the newly created files by
using open flag O_LOV_DELAY_CREATE and ioctl(LL_IOC_LOV_SETSTRIPE). However,
changing the application codes is usually a hard mission and it never happens
for most applications. And individual applications would not have either enough
information or capability to enforce a specific system-wide policy of striping.

That is the reason why we implemented a entirely new striping framework with
the name of RTDS(Real-Time Dynamic Striping).

RTDS controls the striping based on the allocating weights of the OSTs. When
allocating an OST object, RTDS randomly choose an OST. The possibility of
choosing an given OST is proportional to the OST's weight. An allocating weight
is an unsigned number which can be configured by a user space daemon. If a OST
has a weight of zero, then none of the newly created objects will be allocated
on the OST. Let's assume that there are N OSTs, and the OST i has a weight of
W[i], the possibility of allocating a object on the OST i is P[i], then:

S[i] = W[0] + W[1] + W[2] + ... + W[i]
P[i] = W[i] / S[N - 1]
1 = P[0] + P[1] + P[2] + ... + P[N - 1]

In the implementation of RTDS, an RTDS tree is used to choose the OST according
to the weights. An RTDS tree is a binary tree that has following features:

1) The leaves of the RTDS tree is a array with the value of wights, i.e. W[i];

2) The value of the none-leaf node is S[x]. x is the biggest index of the leave in its left
sub-tree.

3) The left sub-tree of a none-leaf node should always be complete binary tree.

Because of rule 3), if a RTDS tree has N leaves (N > 2), then its left sub-tree
has round_down_power_of_2(N) leaves. round_down_power_of_2(N) is the biggest
number that is smaller than N and is the power of 2.

Following is the allocation tree for 8 OSTs.

When choosing an OST to allocate the object, the RTDS policy first generate
a random value between 0 and S[N - 1] - 1. The value will be used to travel
RTDS tree from the root node. If the random value is smaller than a non-leaf
node's value, then the left sub-tree should be choosen, otherwise the right
sub-tree should be choosen. By using the policy, the objects can be allocated
on OSTs randomly, and at the same time the striping ratios between OSTs are
kept.

Currently, only a single RTDS tree is used for default striping. In the future,
multiple RTDS trees will be supported. One RTDS tree is used for each striping
policy. And matching rules will be used to determine which one of the striping
policies should be applied for a given file. The matching rules includes a
series of rules based on the attributes of the newly created file or the
process that is creating the file, e.g. UID, GID, NID, Project ID, Job ID etc.

By using the RTDS, administrator will be able to fully control the striping
policies. Interfaces of configuring the RTDS weights and matching rules will
enable a of new use cases, including:

1) When new OSTs are added into the file system, the administrator might want
to quickly balance the space usages between old OSTs and empty OSTs. Thus, the
striping policy should allocate more new objects on new OSTs than old OSTs.
The administrator can configure higher weights for empty OSTs, which will
eventually cause more balanced space usage on all OSTs.

2) If the file system have of OSTs, SSD based OSTs and HD based OSTs, then a
special kind of striping policy might be necessary to control the usage of
the high speed OSTs. The administrator might only want to let a specific job
to use SSD based OSTs. And also, by combining project quota with RTDS, the
administrator can fully control and limit the space usage of the SSD based
OSTs.

3) When OSTs are doing RAID rebuilding, there might be some performance
degradation. That is one of the cases that OSTs has different peformances
which are dynamically changing from time to time. With RTDS, the adminsitrator
could changing the striping weights according to the current available
bandwiths. Since RTDS is a highly configurable, the administrator can implement
many kinds of policies in user space, e.g. policy based on free bandwidth,
policy based on free space, etc.

4) In order to provide better QoS guarantees, the administrator might implement
policy to reserve the bandwidths of a certain part of OSTs. For example, some
OSTs could be excluded from the striping policy and reserved for a certain job.
Together with TBF NRS policy, a series of QoS solutions could be implemented
based on RTDS.

Currently, RTDS are only implemented for file striping. However, this general
framework could be re-used for directory striping too.

If RTDS works well, codes of OST pool can be removed completely since RTDS
should be able to satisfy all the requirements that OST pool can address. Given
the fact that only a few people are using OST pool (according to a simple
survey that I did on LUG), replacing OST pool with RTDS might not be a bad
idea.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

diffusion allocator.xlsx
45 kB
16/Oct/17 5:06 PM
image-2017-07-31-11-34-30-286.png
8 kB
31/Jul/17 3:35 AM

Issue Links

is related to

LU-9506 sanity test_51d: OST 3 has less objects vs OST 2 (63 < 135)

Resolved

LU-11023 OST Pool Quotas

Resolved

LU-10070 PFL self-extending file layout

Resolved

is related to

LU-9 Optimize weighted QOS Round-Robin allocator

Open

Activity

[LU-9809] RTDS(Real-Time Dynamic Striping): A policy based striping framework

Andreas Dilger added a comment - 07/Aug/17 10:12 PM

Patrick, this is totally achievable with PFL in 2.10. The caveat is that the MDS will still try to avoid this case if possible (exclude stripes on the same OSTs if they were already allocated to the file), unless the requested stripe count for the component exceeds the number of available non-duplicate OSTs. In that case it will allow duplicate OST stripes to be allocated, to handle cases where most/all of the OSTs were used in the first components, then the user wants a fully-striped file for the last component.

There is no actual restriction on this, and if you explicitly request OSTs via "lfs setstripe -o" for different components you can force them to be overloaded.

Andreas Dilger added a comment - 07/Aug/17 10:12 PM Patrick, this is totally achievable with PFL in 2.10. The caveat is that the MDS will still try to avoid this case if possible (exclude stripes on the same OSTs if they were already allocated to the file), unless the requested stripe count for the component exceeds the number of available non-duplicate OSTs. In that case it will allow duplicate OST stripes to be allocated, to handle cases where most/all of the OSTs were used in the first components, then the user wants a fully-striped file for the last component. There is no actual restriction on this, and if you explicitly request OSTs via "lfs setstripe -o" for different components you can force them to be overloaded.

Patrick Farrell (Inactive) added a comment - 07/Aug/17 3:36 PM - edited

"... may need to break the no-same-OST rule in some cases..."

While I haven't evaluated the larger proposals here, I wanted to note that I've had an idea on the backburner for a bit about controlled violations of this rule being highly desirable in certain cases.

In particular, lock ahead is designed to address the case were we are limited to a single shared file, but each OST is significantly faster than one client. In a shared file situation, LDLM locking behavior limits us to writing with one client to each OST, so we are unable to fully drive each OST for the shared file. This is an interesting enough case for Cray to drive all the work in lock ahead use a library to achieve it.

If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. For a variety of reasons, this doesn't remove the need for lockahead (Primarily because we cannot necessarily redefine file striping when we want to write to it), but it is much simpler, and highly desirable for that reason. In addition to the MPIIO aggregation case where we have well controlled I/O and are trying to maximize OST utilization, adding more stripes to a shared file also helps in cases where I/O is poorly controlled, so there are effectively more locks for the badly behaved writers to contend for.

I've moved this as far as coding this up and testing it and observing it works great (once you've removed a number of OST-to-stripe-count ratio related sanity checks), but it needed a workable API and sane interaction/implementation with pools and default striping, and I didn't go that far (yet).

So, in short, I think it would be very, very desirable if, in a controlled manner, we could ask for more than one stripe to be on a given OST. A simple example is something like "8 stripes but only on these 2 OSTs", giving 4 stripes per OST (and allowing 4 client writers per OST with no fancy locking work).

This is a bit outside of the current discussion, but at the same time, it's a feature I'd love to see in any overhaul of the striping/allocation code. If it doesn't fit as part of this, I'll (eventually...) move ahead with it elsewhere.

Patrick Farrell (Inactive) added a comment - 07/Aug/17 3:36 PM - edited "... may need to break the no-same-OST rule in some cases..." While I haven't evaluated the larger proposals here, I wanted to note that I've had an idea on the backburner for a bit about controlled violations of this rule being highly desirable in certain cases. In particular, lock ahead is designed to address the case were we are limited to a single shared file, but each OST is significantly faster than one client. In a shared file situation, LDLM locking behavior limits us to writing with one client to each OST, so we are unable to fully drive each OST for the shared file. This is an interesting enough case for Cray to drive all the work in lock ahead use a library to achieve it. If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. For a variety of reasons, this doesn't remove the need for lockahead (Primarily because we cannot necessarily redefine file striping when we want to write to it), but it is much simpler, and highly desirable for that reason. In addition to the MPIIO aggregation case where we have well controlled I/O and are trying to maximize OST utilization, adding more stripes to a shared file also helps in cases where I/O is poorly controlled, so there are effectively more locks for the badly behaved writers to contend for. I've moved this as far as coding this up and testing it and observing it works great (once you've removed a number of OST-to-stripe-count ratio related sanity checks), but it needed a workable API and sane interaction/implementation with pools and default striping, and I didn't go that far (yet). So, in short, I think it would be very, very desirable if, in a controlled manner, we could ask for more than one stripe to be on a given OST. A simple example is something like "8 stripes but only on these 2 OSTs", giving 4 stripes per OST (and allowing 4 client writers per OST with no fancy locking work). This is a bit outside of the current discussion, but at the same time, it's a feature I'd love to see in any overhaul of the striping/allocation code. If it doesn't fit as part of this, I'll (eventually...) move ahead with it elsewhere.

Andreas Dilger added a comment - 07/Aug/17 1:51 PM

Limit-of-no-same-OST: It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned.

Preference-of-no-same-OSS: It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored.
I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly.

I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this.

I think this parameter works like mute button. And RTDS works like a volume knob.

One idea would be allow setting the degraded parameter to a negative value to allow it to reduce the weight of the OST in all of the RTDS policies when degraded, rather than turning it off completely. That allows it to be controlled locally by the OSS in addition to a global control by the MDS.

That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS.

I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today.

To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator?

Andreas Dilger added a comment - 07/Aug/17 1:51 PM Limit-of-no-same-OST : It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned. Preference-of-no-same-OSS : It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored. I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly. I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this. I think this parameter works like mute button. And RTDS works like a volume knob. One idea would be allow setting the degraded parameter to a negative value to allow it to reduce the weight of the OST in all of the RTDS policies when degraded, rather than turning it off completely. That allows it to be controlled locally by the OSS in addition to a global control by the MDS. That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS. I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today. To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator?

Li Xi (Inactive) added a comment - 07/Aug/17 4:40 AM

??PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. ??

Agreed. As discussed before, striped files can be implemented by changing the weights of OSTs dynamically when allocating objects for a single file.

Li Xi (Inactive) added a comment - 07/Aug/17 4:40 AM ??PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. ?? Agreed. As discussed before, striped files can be implemented by changing the weights of OSTs dynamically when allocating objects for a single file.

Li Xi (Inactive) added a comment - 07/Aug/17 4:39 AM

there is already a mechanism for OSTs to indicate that they are undergoing RAID rebuild or are otherwise degraded - the ofd.*.degraded parameter, which can be set locally on the OSS and is communicated to the MDS to avoid allocations on that OST if possible.

Goot to know. I think this parameter works like mute button. And RTDS works like a volume knob.

Li Xi (Inactive) added a comment - 07/Aug/17 4:39 AM there is already a mechanism for OSTs to indicate that they are undergoing RAID rebuild or are otherwise degraded - the ofd.*.degraded parameter, which can be set locally on the OSS and is communicated to the MDS to avoid allocations on that OST if possible. Goot to know. I think this parameter works like mute button. And RTDS works like a volume knob.

Li Xi (Inactive) added a comment - 07/Aug/17 4:36 AM

??
I'm not a big fan of removing OST pools for RTDS, unless RTDS essentially includes all of the functionality of OST pools (i.e. can be named and explicitly requested by the user). The OST pool functionality has gotten better in 2.9/2.10 in terms of inheriting defaults. Being able to specify OST pools/weights by UID/GID/jobid/projid (and filename, extension, etc) is a great enhancement, but doesn't mean that OST pools should be removed. Essentially, OST pools is RTDS where the weight is 1 for every OST in the pool, and 0 for OSTs not in the pool.??

Yeah, I agree it is too aggressive to remove OST pool. I might need to get ride of this idea

Currently, a RTDS tree is allocated for each OST pool. That means, the OST pool is fully supported by RTDS. Also, I am going to implement UID/GID/jobid/projid matching rules for OST pool/RTDS. So, RTDS will be able to add new features to OST pools.

Li Xi (Inactive) added a comment - 07/Aug/17 4:36 AM ?? I'm not a big fan of removing OST pools for RTDS, unless RTDS essentially includes all of the functionality of OST pools (i.e. can be named and explicitly requested by the user). The OST pool functionality has gotten better in 2.9/2.10 in terms of inheriting defaults. Being able to specify OST pools/weights by UID/GID/jobid/projid (and filename, extension, etc) is a great enhancement, but doesn't mean that OST pools should be removed. Essentially, OST pools is RTDS where the weight is 1 for every OST in the pool, and 0 for OSTs not in the pool.?? Yeah, I agree it is too aggressive to remove OST pool. I might need to get ride of this idea Currently, a RTDS tree is allocated for each OST pool. That means, the OST pool is fully supported by RTDS. Also, I am going to implement UID/GID/jobid/projid matching rules for OST pool/RTDS. So, RTDS will be able to add new features to OST pools.

Li Xi (Inactive) added a comment - 07/Aug/17 4:14 AM

Rather than using weighted random OST allocation, it would be better to do weighted round-robin allocation (see LU-9). The current QoS allocation is weighted random and lots of people complain about that because it doesn't allocate from OSTs very evenly in the short term (e.g. 3 objects from one OST, 4 from most OSTs, and 5 from one OST, which makes the aggregate performance of one OST 5/4=25% slower than most of the others). Using weighted round-robin allocations would avoid this problem, regardless of how the weights are generated.

Weighted round-robin allocation with even usage in short tem is not what RTDS is good at, because RTDS depends a lot on the random number generator. However, I think by changing the random number generator to a sequence generator, it is possible to implement the weighted round-robin allocation.

Let's assume that there are N OSTs, and the OST i has a weight of W[i]. In order to implement precise weighted round-robin allocation, the policy allocates S[N - 1] = W[0] + W[1] + W[2] + ... + W[N - 1] object in each round. And in each round, W[i] objects are allocated for the OST i. An allocating sequence of transfering values for RTDS tree needs to be generated. The value sequence includes all the numbers in 0, 1, 2, 3, ..., S[N - 1], but the order of the sequence needs to be changed in order to avoid allocating objects on the same OST continuously. The simplies way of generating the sequece would be simply shuffle the array randomly, .e.g. using Fisher–Yates shuffle (or Knuth shuffle).

The maximum memory for the allocating sequence would be the maximum value of S[N - 1]. If the maximum weight is M, then maximum value of S[N - 1] is M * N. For 8 bit weight, it would be 255 * N Bytes. 1 MB memory would be able to support 4112 OSTs. The random shuffling time complexity is O(M * N).

I think this implementation would works for some use cases. However, if the weights between OSTs have huge difference, shuffling or any other sequece generator won't help anything at all. For example, if the an OST has 100% of the weights, then the OST will be always selected continuously. Also, support for striped files needs extra effort in order to enforce Limit-of-no-same-OST or Preference-of-no-same-OSS.

In order to select multiple objects on different OSTs/OSS, RTDS needs to scan the sequence to choose multiple OSTs/OSS, and then remove the selected value from the sequence. Thus, the time complexity of allocating a file with multiple objects would be O(M * N), by contrast the time complexity of allocating a file with a single object would be O(1). Also, this would requre extra memory to save the changed sequence, which has a size of O(M * N).

Li Xi (Inactive) added a comment - 07/Aug/17 4:14 AM Rather than using weighted random OST allocation, it would be better to do weighted round-robin allocation (see LU-9 ). The current QoS allocation is weighted random and lots of people complain about that because it doesn't allocate from OSTs very evenly in the short term (e.g. 3 objects from one OST, 4 from most OSTs, and 5 from one OST, which makes the aggregate performance of one OST 5/4=25% slower than most of the others). Using weighted round-robin allocations would avoid this problem, regardless of how the weights are generated. Weighted round-robin allocation with even usage in short tem is not what RTDS is good at, because RTDS depends a lot on the random number generator. However, I think by changing the random number generator to a sequence generator, it is possible to implement the weighted round-robin allocation. Let's assume that there are N OSTs, and the OST i has a weight of W [i] . In order to implement precise weighted round-robin allocation, the policy allocates S [N - 1] = W [0] + W [1] + W [2] + ... + W [N - 1] object in each round. And in each round, W [i] objects are allocated for the OST i. An allocating sequence of transfering values for RTDS tree needs to be generated. The value sequence includes all the numbers in 0, 1, 2, 3, ..., S [N - 1] , but the order of the sequence needs to be changed in order to avoid allocating objects on the same OST continuously. The simplies way of generating the sequece would be simply shuffle the array randomly, .e.g. using Fisher–Yates shuffle (or Knuth shuffle). The maximum memory for the allocating sequence would be the maximum value of S [N - 1] . If the maximum weight is M, then maximum value of S [N - 1] is M * N. For 8 bit weight, it would be 255 * N Bytes. 1 MB memory would be able to support 4112 OSTs. The random shuffling time complexity is O(M * N). I think this implementation would works for some use cases. However, if the weights between OSTs have huge difference, shuffling or any other sequece generator won't help anything at all. For example, if the an OST has 100% of the weights, then the OST will be always selected continuously. Also, support for striped files needs extra effort in order to enforce Limit-of-no-same-OST or Preference-of-no-same-OSS. In order to select multiple objects on different OSTs/OSS, RTDS needs to scan the sequence to choose multiple OSTs/OSS, and then remove the selected value from the sequence. Thus, the time complexity of allocating a file with multiple objects would be O(M * N), by contrast the time complexity of allocating a file with a single object would be O(1). Also, this would requre extra memory to save the changed sequence, which has a size of O(M * N).

Li Xi (Inactive) added a comment - 07/Aug/17 3:18 AM

for this patch to land, it should include the current QoS functionality, so that the OST weights are based on OST available space if not otherwise specified. Otherwise, we are duplicating functionality unnecessarily.

The idea of using RTDS is to implement a general framework with which the administrator or a user-space daemon can control the striping policy completely. So, indead of implementing the QoS functionality in side RTDS, a user-space tool should be provided. The tool will monitor the available spaces and set the OST weights from time to time. I would like to work on the tool in the following months. And it would be extremely easy to extend the tool to support QoS based on available bandwith.

Do you think it is necessary to implement a kernel-space QoS control of RTDS? My perference would be no.

Li Xi (Inactive) added a comment - 07/Aug/17 3:18 AM for this patch to land, it should include the current QoS functionality, so that the OST weights are based on OST available space if not otherwise specified. Otherwise, we are duplicating functionality unnecessarily. The idea of using RTDS is to implement a general framework with which the administrator or a user-space daemon can control the striping policy completely. So, indead of implementing the QoS functionality in side RTDS, a user-space tool should be provided. The tool will monitor the available spaces and set the OST weights from time to time. I would like to work on the tool in the following months. And it would be extremely easy to extend the tool to support QoS based on available bandwith. Do you think it is necessary to implement a kernel-space QoS control of RTDS? My perference would be no.

Li Xi (Inactive) added a comment - 07/Aug/17 3:11 AM

??It is possible to enhance the arrangement of the tree by clusters of OSTs? This can be easily done by encode some high bit in the weight to geographic information of OSTs so that it's easier for the LOD code to allocate an object from a cluster of OSTs, where they usually belong to the same OSS or rack.

Let's say weight is a 32bit integer, and the highest byte is used to encode rack number, and the 2nd highest byte as OSS number. Therefore for multiple objects allocation for striped files, the LOD can pack a number as <rack> + <OSS> + <random> for RTDS tree traverse, and then change rack or OSS and new random number to allocate the next one, etc.??

In the RTDS, the weight values is the value that indicates the probability of selecting an OST. I might not understand your ideal completely. I am not sure how can we assume rule Limit-of-no-same-OST by changing the meaning of weight.

Li Xi (Inactive) added a comment - 07/Aug/17 3:11 AM ??It is possible to enhance the arrangement of the tree by clusters of OSTs? This can be easily done by encode some high bit in the weight to geographic information of OSTs so that it's easier for the LOD code to allocate an object from a cluster of OSTs, where they usually belong to the same OSS or rack. Let's say weight is a 32bit integer, and the highest byte is used to encode rack number, and the 2nd highest byte as OSS number. Therefore for multiple objects allocation for striped files, the LOD can pack a number as <rack> + <OSS> + <random> for RTDS tree traverse, and then change rack or OSS and new random number to allocate the next one, etc.?? In the RTDS, the weight values is the value that indicates the probability of selecting an OST. I might not understand your ideal completely. I am not sure how can we assume rule Limit-of-no-same-OST by changing the meaning of weight.

Li Xi (Inactive) added a comment - 07/Aug/17 3:01 AM

I think there are some assumption or precondition when we are discussing about the striping policy. I think we all agree with the following:

Limit-of-no-same-OST: It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned.

Preference-of-no-same-OSS: It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored.

I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly.

Then you have to copy RTDS tree every single time when objects for striped files are allocated.
I just realized that it is unncessary to copy the entire RTDS tree. The RTDS tree would be the same for all kinds of weights, thus only the weight array needs to be copied. Currently the weight value is a 64 bit interger, but I think 8 bit would be completely enough. What do you think? If the weight value is one byte, then each OST needs 5 bytes in the weight array, because the weight sum needs 4 bytes (or 3 bytes to supports 65536 OSTs which is enough, but 4 bytes is simpler). So, with only copying 4KB memory, 819 OSTs can be supported.

As long as we can change the weights of OSTs dynamically when allocating objects for a single file, things become much easier. Limit-of-no-same-OST can be implemented by changing the weights of the selected OSTs to zero. Preference-of-no-same-OSS can be implemented in the similar way: first the change the weights of the OSTs on selected OSS to zero, and then change them back if no available OSTs are left.

As mentioned before, multiple stripe for one file would cause the real ratios of objects to be different from the configured weights. However, I think in all cases, the following sentence is ture:

positive-correlation-between-the-probability-and-weight: The statistical probability of selecting an OST would not be reduced by increasing the weight of that OST.

And I think in most of (if not all) the cases, the following sentence is ture:

same-order-for-probability-and-weight: If an OST A has smaller weight than OST B, then the possibility of selecting A would not be higher than the possibility of selecting B.

Ideally, we should find a way to prove positive-correlation-between-the-probability-and-weight. And it would be really helpful if we find all the possible conditions in which same-order-for-probability-and-weight is false.

Following is an example:

Let's assume there are only three OSTs with weight of 1, 2, 4. And all the files will be created with stripe cound of 2. Then, the real ratio would be (1 + 2 * 1 / (1 + 4) + 4 * 1 / (1 + 2)) : (1 * 2 / (2 + 4) + 2 + 4 * 2 / (1 + 2)) : (1 * 4 / (2 + 4) + 2 * 4 / (1 + 4) + 4), i.e. 41 : 75 : 94. This is different from 1 : 2 : 4, but yet keeps same-order-for-probability-and-weight.

Li Xi (Inactive) added a comment - 07/Aug/17 3:01 AM I think there are some assumption or precondition when we are discussing about the striping policy. I think we all agree with the following: Limit-of-no-same-OST : It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned. Preference-of-no-same-OSS : It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored. I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly. Then you have to copy RTDS tree every single time when objects for striped files are allocated. I just realized that it is unncessary to copy the entire RTDS tree. The RTDS tree would be the same for all kinds of weights, thus only the weight array needs to be copied. Currently the weight value is a 64 bit interger, but I think 8 bit would be completely enough. What do you think? If the weight value is one byte, then each OST needs 5 bytes in the weight array, because the weight sum needs 4 bytes (or 3 bytes to supports 65536 OSTs which is enough, but 4 bytes is simpler). So, with only copying 4KB memory, 819 OSTs can be supported. As long as we can change the weights of OSTs dynamically when allocating objects for a single file, things become much easier. Limit-of-no-same-OST can be implemented by changing the weights of the selected OSTs to zero. Preference-of-no-same-OSS can be implemented in the similar way: first the change the weights of the OSTs on selected OSS to zero, and then change them back if no available OSTs are left. As mentioned before, multiple stripe for one file would cause the real ratios of objects to be different from the configured weights. However, I think in all cases, the following sentence is ture: positive-correlation-between-the-probability-and-weight : The statistical probability of selecting an OST would not be reduced by increasing the weight of that OST. And I think in most of (if not all) the cases, the following sentence is ture: same-order-for-probability-and-weight : If an OST A has smaller weight than OST B, then the possibility of selecting A would not be higher than the possibility of selecting B. Ideally, we should find a way to prove positive-correlation-between-the-probability-and-weight. And it would be really helpful if we find all the possible conditions in which same-order-for-probability-and-weight is false. Following is an example: Let's assume there are only three OSTs with weight of 1, 2, 4. And all the files will be created with stripe cound of 2. Then, the real ratio would be (1 + 2 * 1 / (1 + 4) + 4 * 1 / (1 + 2)) : (1 * 2 / (2 + 4) + 2 + 4 * 2 / (1 + 2)) : (1 * 4 / (2 + 4) + 2 * 4 / (1 + 4) + 4), i.e. 41 : 75 : 94. This is different from 1 : 2 : 4, but yet keeps same-order-for-probability-and-weight.

Li Xi (Inactive) added a comment - 07/Aug/17 1:17 AM

Andreas and Jinshan, thank you so much! Your comments are really helpful for me to understand the requirements!

Li Xi (Inactive) added a comment - 07/Aug/17 1:17 AM Andreas and Jinshan, thank you so much! Your comments are really helpful for me to understand the requirements!

People

Assignee:: Li Xi

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 31/Jul/17 3:26 AM

Updated:: 27/Dec/18 9:25 PM