Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9809

RTDS(Real-Time Dynamic Striping): A policy based striping framework

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      Currently, there are several ways to control the striping of files:

      1) Default striping inherited from parent directory

      2) OST pool based striping inherited from parent directory

      3) Specific striping on a user defined set of OSTs through creating with open
      flag O_LOV_DELAY_CREATE and then ioctl(LL_IOC_LOV_SETSTRIPE)

      The default or OST pool based striping tries to ensure that space is evenly
      used by OSTs and OSSs, thus the algorithm allocates OST objects according to
      the free space on the OSTs. The algorithm is written in kernel space codes,
      which means the administrators barely have any way to change the striping
      policy even if they realize there are a better policies than the current one.

      The application can surely control the striping of the newly created files by
      using open flag O_LOV_DELAY_CREATE and ioctl(LL_IOC_LOV_SETSTRIPE). However,
      changing the application codes is usually a hard mission and it never happens
      for most applications. And individual applications would not have either enough
      information or capability to enforce a specific system-wide policy of striping.

      That is the reason why we implemented a entirely new striping framework with
      the name of RTDS(Real-Time Dynamic Striping).

      RTDS controls the striping based on the allocating weights of the OSTs. When
      allocating an OST object, RTDS randomly choose an OST. The possibility of
      choosing an given OST is proportional to the OST's weight. An allocating weight
      is an unsigned number which can be configured by a user space daemon. If a OST
      has a weight of zero, then none of the newly created objects will be allocated
      on the OST. Let's assume that there are N OSTs, and the OST i has a weight of
      W[i], the possibility of allocating a object on the OST i is P[i], then:

      S[i] = W[0] + W[1] + W[2] + ... + W[i]
      P[i] = W[i] / S[N - 1]
      1 = P[0] + P[1] + P[2] + ... + P[N - 1]

      In the implementation of RTDS, an RTDS tree is used to choose the OST according
      to the weights. An RTDS tree is a binary tree that has following features:

      1) The leaves of the RTDS tree is a array with the value of wights, i.e. W[i];

      2) The value of the none-leaf node is S[x]. x is the biggest index of the leave in its left
      sub-tree.

      3) The left sub-tree of a none-leaf node should always be complete binary tree.

      Because of rule 3), if a RTDS tree has N leaves (N > 2), then its left sub-tree
      has round_down_power_of_2(N) leaves. round_down_power_of_2(N) is the biggest
      number that is smaller than N and is the power of 2.

      Following is the allocation tree for 8 OSTs.

      When choosing an OST to allocate the object, the RTDS policy first generate
      a random value between 0 and S[N - 1] - 1. The value will be used to travel
      RTDS tree from the root node. If the random value is smaller than a non-leaf
      node's value, then the left sub-tree should be choosen, otherwise the right
      sub-tree should be choosen. By using the policy, the objects can be allocated
      on OSTs randomly, and at the same time the striping ratios between OSTs are
      kept.

      Currently, only a single RTDS tree is used for default striping. In the future,
      multiple RTDS trees will be supported. One RTDS tree is used for each striping
      policy. And matching rules will be used to determine which one of the striping
      policies should be applied for a given file. The matching rules includes a
      series of rules based on the attributes of the newly created file or the
      process that is creating the file, e.g. UID, GID, NID, Project ID, Job ID etc.

      By using the RTDS, administrator will be able to fully control the striping
      policies. Interfaces of configuring the RTDS weights and matching rules will
      enable a of new use cases, including:

      1) When new OSTs are added into the file system, the administrator might want
      to quickly balance the space usages between old OSTs and empty OSTs. Thus, the
      striping policy should allocate more new objects on new OSTs than old OSTs.
      The administrator can configure higher weights for empty OSTs, which will
      eventually cause more balanced space usage on all OSTs.

      2) If the file system have of OSTs, SSD based OSTs and HD based OSTs, then a
      special kind of striping policy might be necessary to control the usage of
      the high speed OSTs. The administrator might only want to let a specific job
      to use SSD based OSTs. And also, by combining project quota with RTDS, the
      administrator can fully control and limit the space usage of the SSD based
      OSTs.

      3) When OSTs are doing RAID rebuilding, there might be some performance
      degradation. That is one of the cases that OSTs has different peformances
      which are dynamically changing from time to time. With RTDS, the adminsitrator
      could changing the striping weights according to the current available
      bandwiths. Since RTDS is a highly configurable, the administrator can implement
      many kinds of policies in user space, e.g. policy based on free bandwidth,
      policy based on free space, etc.

      4) In order to provide better QoS guarantees, the administrator might implement
      policy to reserve the bandwidths of a certain part of OSTs. For example, some
      OSTs could be excluded from the striping policy and reserved for a certain job.
      Together with TBF NRS policy, a series of QoS solutions could be implemented
      based on RTDS.

      Currently, RTDS are only implemented for file striping. However, this general
      framework could be re-used for directory striping too.

      If RTDS works well, codes of OST pool can be removed completely since RTDS
      should be able to satisfy all the requirements that OST pool can address. Given
      the fact that only a few people are using OST pool (according to a simple
      survey that I did on LUG), replacing OST pool with RTDS might not be a bad
      idea.

      Attachments

        Issue Links

          Activity

            [LU-9809] RTDS(Real-Time Dynamic Striping): A policy based striping framework

            I think one important thing to remember is that large imbalanced weights should be fairly rare. One of the more important benefits of the weighted round-robin allocator should be that it is always rebalancing the system. If the OST selection algorithms are stable (i.e. don't pile hundreds of new files on a single OST that has no IO load because it was just rebooted) then they should slowly drive all weights to be balanced all the time, so there should only be a little error accumulated on any given round.

            I agree with Nathan's suggestion that for widely-striped files we should select from the sorted list of available OSTs based on the minimum error, so that we accumulate the least additional error when using those OSTs. The current object allocation policy allows a file to be created if it has at least 3/4 of the requested stripes, so that file creation is not blocked/failed if some OSTs are offline (better to have 75% IO speed than none at all).

            adilger Andreas Dilger added a comment - I think one important thing to remember is that large imbalanced weights should be fairly rare. One of the more important benefits of the weighted round-robin allocator should be that it is always rebalancing the system. If the OST selection algorithms are stable (i.e. don't pile hundreds of new files on a single OST that has no IO load because it was just rebooted) then they should slowly drive all weights to be balanced all the time, so there should only be a little error accumulated on any given round. I agree with Nathan's suggestion that for widely-striped files we should select from the sorted list of available OSTs based on the minimum error, so that we accumulate the least additional error when using those OSTs. The current object allocation policy allows a file to be created if it has at least 3/4 of the requested stripes, so that file creation is not blocked/failed if some OSTs are offline (better to have 75% IO speed than none at all).

            diffusion allocator.xlsx
            As discussed in LAD, using a random-number based selection process (as in the current QoS) has a couple of problems:
            1. Allocations will not be "regular" - randomly we may use the same OST a number of times in a row
            2. Allocations are not predictable - hard to debug when and why particular OSTs are chosen.

            Instead, we should aim for a fully-predictable, regular allocations. Andreas' "error diffusion allocator" idea seems fine; I've attached a spreadsheet showing how one implementation might work. Weights are still input from userspace (a good idea), and if the "error" (the difference between some fixed reference (e.g. max weight) and the per-OST weight) builds up until, after some threshold, the OST is skipped, and the error is reset. You can change weights at any time. It can accommodate the relative weights and the RTDS tree - a form of weighted round-robin.

            One other aspect needs to be addressed, which is that if a single file needs a certain number of OSTs (stripes), then the threshold might have to move to allow sufficient OSTs - we can't "randomly choose another OST" like you can with a probabilistic method. So instead of the output being a binary use/don't use, we need to end up with a cumulative-error sorted list, and if stripes_requested > stripes_available, change threshold so that stripes_avail = stripes_requested, for this allocation only. (Normally you want avail >> requested, so that we get round-robin type behavior most of the time.)

            nrutman Nathan Rutman added a comment - diffusion allocator.xlsx As discussed in LAD, using a random-number based selection process (as in the current QoS) has a couple of problems: 1. Allocations will not be "regular" - randomly we may use the same OST a number of times in a row 2. Allocations are not predictable - hard to debug when and why particular OSTs are chosen. Instead, we should aim for a fully-predictable, regular allocations. Andreas' "error diffusion allocator" idea seems fine; I've attached a spreadsheet showing how one implementation might work. Weights are still input from userspace (a good idea), and if the "error" (the difference between some fixed reference (e.g. max weight) and the per-OST weight) builds up until, after some threshold, the OST is skipped, and the error is reset. You can change weights at any time. It can accommodate the relative weights and the RTDS tree - a form of weighted round-robin. One other aspect needs to be addressed, which is that if a single file needs a certain number of OSTs (stripes), then the threshold might have to move to allow sufficient OSTs - we can't "randomly choose another OST" like you can with a probabilistic method. So instead of the output being a binary use/don't use, we need to end up with a cumulative-error sorted list, and if stripes_requested > stripes_available, change threshold so that stripes_avail = stripes_requested, for this allocation only. (Normally you want avail >> requested, so that we get round-robin type behavior most of the time.)

            If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort.

            As discussed before, putting multiple stripes on a single OST can be implemented in RTDS by configuring relative weight. And an policy can be defined for this, which will enforce this only to a selected part of files (for example, only the files with a given project ID), while use normal striping policies for other files.

            lixi Li Xi (Inactive) added a comment - If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. As discussed before, putting multiple stripes on a single OST can be implemented in RTDS by configuring relative weight. And an policy can be defined for this, which will enforce this only to a selected part of files (for example, only the files with a given project ID), while use normal striping policies for other files.

            I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this.

            Thanks for the info. I didn't realize the special requirement of PFL and FLR.

            The allocator indeed needs to collect a lot more information about the system to make advanced striping decisions. However, I don't think it is partical to collect all the original info in kernel-space LOD. The information needed by the allocator might change from time to time. As you have said, the NID, controller, rack, UPS of the OSTs are needed. And if the OSTs have different specifications with each other, the information becomes more complex. The storage types (SSD vs. HDD), network types (IB vs. Ethernet), controller specification and other information might be needed by the allocator. Of course, support for collecting more original info can be added into LOD gradually. But that would make the codes complex and hard to maintain.

            That is part of the reason why I am trying to abstract these info into OST weights that are dynamically changing. Ideally, when we are adding new features to allocator, the main framework of kernel-space codes can be kept untouched, and we only need to change the user-space daemon that updates the OSTs weights.

            The current RTDS implementation enables a lot of potential new features, e.g. balance of bandwidths between OSTs. And I'd like to emphasize that these features can be implemented by only changing user-space tool. That means, an experienced administrator could implement a customized policy by writing/changing a Python/Shell script without even knows how to do kernel space development.

            As an abstract of the various requirements, I think we can introduce a new concept: relative allocation weight between OSTs. The relative weight is a value between two OSTs. Let's say, the relative weight of OST i and OST j is RW(i, j). RW(i, j) means if the OST i is being selected as a stripe of a file, then the weight of OST j (originally W[j]) should be changed to W[j] = W[j] * RW(i, j) for allocating the next stripe of the file. RW(i, j) can be a value between zero and infinite. RW(i, j) is usually equal to RW(j, i), but they could be different. Like weights of OSTs, relative weights between OSTs could be changed from user space daemons. By using relative weights, a lot of features is enabled:

            1) RW(i, i) can be set to zero to avoid allocating more than one objects on OST i for a single file.

            2) For i != j, one is the default value of RW(i, i) for a simple allocating policy, because usually the OSTs are considered unrelated.

            3) RW(i, j) can be set to smaller-than-one value if we prefer not to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j are on the same OSS. We might want to avoid locating two stripes on the same OSS for performance goals.

            4) RW(i, j) can be set to bigger-than-one value if we prefer to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j have the same specification (e.g. SSD/HDD based OSTs). We might prefer to use same kind of OSTs for the stripes of a file.

            5) An infinite (+INFI) value can be set to RW(i, j), if we want to ensure that the next stripe of the file will be on OST j if the file has a stripe on OST i. One example of this is: we want to locate all of the rest stripes of a file on the same selected OST, so we set the R(i, i) to +INFI. As a implementation detail, instead of actually doing the calculation of W[j] = W[j] * RW(i, j) when RW(i, j) is +INFI, we can set W[j] to one, and other weights to zero.

            It might be complex to analyze how relative weights would affect the process of file striping if relative weights are configured without taking enough care. But things would be much simpler if all the relative weights are 0, 1/2, 1, 2, or +INFI. And with these relative weight options, we can already implement a lot of feature.

            That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS.

            I must admit that I haven't looked into all the details of the current round-robin or QoS allocator. But I am surely aware that the current allocators handles a lot of details properly and works pretty well on what they are designed to. I am not expecting that RTDS can be used without a lot of tuning. And I think that will take months of efforts. But I think as soon as we get a stable and fast RTDS framework, adding more features would be much simpler than before.

            I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today.

            It is of course possible to implement a mechanism in kernel space that updates the OST weights. But I still don't understand why a user-space daemon can't do the work properly. If the daemon needed to change the weights every one micro second, that would be a problem. But since the weights only needs to be updated every few seconds, there shouldn't be any problem. And the biggest advantage of implementing the mechanism in user-space tool is flexibility. A tool written in C/Python/Shell or any other languages can be used. And the tool can be changed whenever it is found not work well. I don't think we have this flexibility if it is implemented completely in LOD. Also, collecting information in a user space daemon is much simpler than inside LOD. "ssh" commands can be used by the tool to login OSS and collect all of the necessary statistics.

            As a side example, we have implemented a tool (LIME, https://github.com/DDNStorage/Lime) which collects the real-time performance statistics of a job, and changes the TBF rates every one second to provide QoS guarantees or enforce performance limitations. This tool is written in python. And it works well at least in a very small cluster. I think a similar tool can be implemented for RTDS.

            To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator?

            I fully agree that the current QoS weighted random allocator works pretty well if there were no upcoming requirements of new features. And there might be better structure than RTDS which can do similar things. But to support the features that RTDS tree can, the structure needs to:

            1) Support quick update of weights from user space tool. RTDS can finish the update in time of O(N), with N as the OST number. And no memory allocation is needed in this process.

            2) Support dynamic weight changing during the process of allocating multiple objects for a single file. RTDS tree can support it by copy the weight array. That means, each allocating thread will only need to hold a read lock of the tree and copy the weight array. And then the weights can be changed freely without influencing other threads

            3) Support weighted round-robin. RTDS tree can support it by changing the random value generator to a sequence generator based on weights. The time/memory cost has been analyzed before, which I think is affordable.

            I will need to investigate the current weighted random allocator more to check whether RTDS tree can be replaced by it. But I still like the design of RTDS tree, because it is simple and at the same time is very powerful.

            lixi Li Xi (Inactive) added a comment - I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this. Thanks for the info. I didn't realize the special requirement of PFL and FLR. The allocator indeed needs to collect a lot more information about the system to make advanced striping decisions. However, I don't think it is partical to collect all the original info in kernel-space LOD. The information needed by the allocator might change from time to time. As you have said, the NID, controller, rack, UPS of the OSTs are needed. And if the OSTs have different specifications with each other, the information becomes more complex. The storage types (SSD vs. HDD), network types (IB vs. Ethernet), controller specification and other information might be needed by the allocator. Of course, support for collecting more original info can be added into LOD gradually. But that would make the codes complex and hard to maintain. That is part of the reason why I am trying to abstract these info into OST weights that are dynamically changing. Ideally, when we are adding new features to allocator, the main framework of kernel-space codes can be kept untouched, and we only need to change the user-space daemon that updates the OSTs weights. The current RTDS implementation enables a lot of potential new features, e.g. balance of bandwidths between OSTs. And I'd like to emphasize that these features can be implemented by only changing user-space tool. That means, an experienced administrator could implement a customized policy by writing/changing a Python/Shell script without even knows how to do kernel space development. As an abstract of the various requirements, I think we can introduce a new concept: relative allocation weight between OSTs. The relative weight is a value between two OSTs. Let's say, the relative weight of OST i and OST j is RW(i, j). RW(i, j) means if the OST i is being selected as a stripe of a file, then the weight of OST j (originally W [j] ) should be changed to W [j] = W [j] * RW(i, j) for allocating the next stripe of the file. RW(i, j) can be a value between zero and infinite. RW(i, j) is usually equal to RW(j, i), but they could be different. Like weights of OSTs, relative weights between OSTs could be changed from user space daemons. By using relative weights, a lot of features is enabled: 1) RW(i, i) can be set to zero to avoid allocating more than one objects on OST i for a single file. 2) For i != j, one is the default value of RW(i, i) for a simple allocating policy, because usually the OSTs are considered unrelated. 3) RW(i, j) can be set to smaller-than-one value if we prefer not to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j are on the same OSS. We might want to avoid locating two stripes on the same OSS for performance goals. 4) RW(i, j) can be set to bigger-than-one value if we prefer to locate a stripe of a file on OST j if the file has a stripe on OST i. One example of this is that OST i and OST j have the same specification (e.g. SSD/HDD based OSTs). We might prefer to use same kind of OSTs for the stripes of a file. 5) An infinite (+INFI) value can be set to RW(i, j), if we want to ensure that the next stripe of the file will be on OST j if the file has a stripe on OST i. One example of this is: we want to locate all of the rest stripes of a file on the same selected OST, so we set the R(i, i) to +INFI. As a implementation detail, instead of actually doing the calculation of W [j] = W [j] * RW(i, j) when RW(i, j) is +INFI, we can set W [j] to one, and other weights to zero. It might be complex to analyze how relative weights would affect the process of file striping if relative weights are configured without taking enough care. But things would be much simpler if all the relative weights are 0, 1/2, 1, 2, or +INFI. And with these relative weight options, we can already implement a lot of feature. That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS. I must admit that I haven't looked into all the details of the current round-robin or QoS allocator. But I am surely aware that the current allocators handles a lot of details properly and works pretty well on what they are designed to. I am not expecting that RTDS can be used without a lot of tuning. And I think that will take months of efforts. But I think as soon as we get a stable and fast RTDS framework, adding more features would be much simpler than before. I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today. It is of course possible to implement a mechanism in kernel space that updates the OST weights. But I still don't understand why a user-space daemon can't do the work properly. If the daemon needed to change the weights every one micro second, that would be a problem. But since the weights only needs to be updated every few seconds, there shouldn't be any problem. And the biggest advantage of implementing the mechanism in user-space tool is flexibility. A tool written in C/Python/Shell or any other languages can be used. And the tool can be changed whenever it is found not work well. I don't think we have this flexibility if it is implemented completely in LOD. Also, collecting information in a user space daemon is much simpler than inside LOD. "ssh" commands can be used by the tool to login OSS and collect all of the necessary statistics. As a side example, we have implemented a tool (LIME, https://github.com/DDNStorage/Lime ) which collects the real-time performance statistics of a job, and changes the TBF rates every one second to provide QoS guarantees or enforce performance limitations. This tool is written in python. And it works well at least in a very small cluster. I think a similar tool can be implemented for RTDS. To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator ? I fully agree that the current QoS weighted random allocator works pretty well if there were no upcoming requirements of new features. And there might be better structure than RTDS which can do similar things. But to support the features that RTDS tree can, the structure needs to: 1) Support quick update of weights from user space tool. RTDS can finish the update in time of O(N), with N as the OST number. And no memory allocation is needed in this process. 2) Support dynamic weight changing during the process of allocating multiple objects for a single file. RTDS tree can support it by copy the weight array. That means, each allocating thread will only need to hold a read lock of the tree and copy the weight array. And then the weights can be changed freely without influencing other threads 3) Support weighted round-robin. RTDS tree can support it by changing the random value generator to a sequence generator based on weights. The time/memory cost has been analyzed before, which I think is affordable. I will need to investigate the current weighted random allocator more to check whether RTDS tree can be replaced by it. But I still like the design of RTDS tree, because it is simple and at the same time is very powerful.

            Patrick, this is totally achievable with PFL in 2.10. The caveat is that the MDS will still try to avoid this case if possible (exclude stripes on the same OSTs if they were already allocated to the file), unless the requested stripe count for the component exceeds the number of available non-duplicate OSTs. In that case it will allow duplicate OST stripes to be allocated, to handle cases where most/all of the OSTs were used in the first components, then the user wants a fully-striped file for the last component.

            There is no actual restriction on this, and if you explicitly request OSTs via "lfs setstripe -o" for different components you can force them to be overloaded.

            adilger Andreas Dilger added a comment - Patrick, this is totally achievable with PFL in 2.10. The caveat is that the MDS will still try to avoid this case if possible (exclude stripes on the same OSTs if they were already allocated to the file), unless the requested stripe count for the component exceeds the number of available non-duplicate OSTs. In that case it will allow duplicate OST stripes to be allocated, to handle cases where most/all of the OSTs were used in the first components, then the user wants a fully-striped file for the last component. There is no actual restriction on this, and if you explicitly request OSTs via "lfs setstripe -o" for different components you can force them to be overloaded.

            "... may need to break the no-same-OST rule in some cases..."

            While I haven't evaluated the larger proposals here, I wanted to note that I've had an idea on the backburner for a bit about controlled violations of this rule being highly desirable in certain cases.

            In particular, lock ahead is designed to address the case were we are limited to a single shared file, but each OST is significantly faster than one client. In a shared file situation, LDLM locking behavior limits us to writing with one client to each OST, so we are unable to fully drive each OST for the shared file. This is an interesting enough case for Cray to drive all the work in lock ahead use a library to achieve it.

            If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. For a variety of reasons, this doesn't remove the need for lockahead (Primarily because we cannot necessarily redefine file striping when we want to write to it), but it is much simpler, and highly desirable for that reason. In addition to the MPIIO aggregation case where we have well controlled I/O and are trying to maximize OST utilization, adding more stripes to a shared file also helps in cases where I/O is poorly controlled, so there are effectively more locks for the badly behaved writers to contend for.

            I've moved this as far as coding this up and testing it and observing it works great (once you've removed a number of OST-to-stripe-count ratio related sanity checks), but it needed a workable API and sane interaction/implementation with pools and default striping, and I didn't go that far (yet).

            So, in short, I think it would be very, very desirable if, in a controlled manner, we could ask for more than one stripe to be on a given OST. A simple example is something like "8 stripes but only on these 2 OSTs", giving 4 stripes per OST (and allowing 4 client writers per OST with no fancy locking work).

            This is a bit outside of the current discussion, but at the same time, it's a feature I'd love to see in any overhaul of the striping/allocation code. If it doesn't fit as part of this, I'll (eventually...) move ahead with it elsewhere.

            paf Patrick Farrell (Inactive) added a comment - - edited "... may need to break the no-same-OST rule in some cases..." While I haven't evaluated the larger proposals here, I wanted to note that I've had an idea on the backburner for a bit about controlled violations of this rule being highly desirable in certain cases. In particular, lock ahead is designed to address the case were we are limited to a single shared file, but each OST is significantly faster than one client. In a shared file situation, LDLM locking behavior limits us to writing with one client to each OST, so we are unable to fully drive each OST for the shared file. This is an interesting enough case for Cray to drive all the work in lock ahead use a library to achieve it. If we can put multiple stripes of the file on a single OST, we can essentially achieve the same thing, with far less effort. For a variety of reasons, this doesn't remove the need for lockahead (Primarily because we cannot necessarily redefine file striping when we want to write to it), but it is much simpler, and highly desirable for that reason. In addition to the MPIIO aggregation case where we have well controlled I/O and are trying to maximize OST utilization, adding more stripes to a shared file also helps in cases where I/O is poorly controlled, so there are effectively more locks for the badly behaved writers to contend for. I've moved this as far as coding this up and testing it and observing it works great (once you've removed a number of OST-to-stripe-count ratio related sanity checks), but it needed a workable API and sane interaction/implementation with pools and default striping, and I didn't go that far (yet). So, in short, I think it would be very, very desirable if, in a controlled manner, we could ask for more than one stripe to be on a given OST. A simple example is something like "8 stripes but only on these 2 OSTs", giving 4 stripes per OST (and allowing 4 client writers per OST with no fancy locking work). This is a bit outside of the current discussion, but at the same time, it's a feature I'd love to see in any overhaul of the striping/allocation code. If it doesn't fit as part of this, I'll (eventually...) move ahead with it elsewhere.

            Limit-of-no-same-OST: It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned.

            Preference-of-no-same-OSS: It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored.
            I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly.

            I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this.

            I think this parameter works like mute button. And RTDS works like a volume knob.

            • One idea would be allow setting the degraded parameter to a negative value to allow it to reduce the weight of the OST in all of the RTDS policies when degraded, rather than turning it off completely. That allows it to be controlled locally by the OSS in addition to a global control by the MDS.

            That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS.

            I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today.

            To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator?

            adilger Andreas Dilger added a comment - Limit-of-no-same-OST : It should be aways avoided to stripe more than one objects of a file on the same OST, because that would cause not only performance problem but also file size limitation problem. This is a hard limitation, which means, if it is impossible to ensure that, a failure could be returned. Preference-of-no-same-OSS : It should be usually avoided to stripe more than one objects of a file on the same OSS, because that would cause performance problem. However, this is not a hard limitation, which means if it is impossible to ensure that, this limitation could be ignored. I think these two rules are implied by all striping policies. If these two rules didn't exist, the design of RTDS would be very different. Thus, I think it is helpful to list these two rules explicitly. I would agree that these are desirable properties for any allocator. One caveat is that PFL and FLR allocators may need to break the no-same-OST rule in some cases, but this is OK because the composite files will have components of limited size, so if an OST is used twice within a single file then the amount of conflict is limited. Any new allocator should take PFL into account, and we are also looking at how to improve the allocator for FLR (avoid mirror allocation in a single fault domain). This is easily done for avoiding allocations on the same OSS (avoid all OSTs on one NID) and possibly on the same failover pair, but LOD doesn't yet have enough info for avoiding the same controller, rack, UPS, which is why Jinshan was asking about this. I think this parameter works like mute button. And RTDS works like a volume knob. One idea would be allow setting the degraded parameter to a negative value to allow it to reduce the weight of the OST in all of the RTDS policies when degraded, rather than turning it off completely. That allows it to be controlled locally by the OSS in addition to a global control by the MDS. That said, several of the comments here make me think that you haven't had a close look at the existing allocator code to understand the details. There is already a mechanism in the round-robin allocator to avoid starting on the same OST each time, and also a mechanism to avoid allocating from the same OSS unless necessary. A lot of tuning went into that code to make it work well, and be fast, so it shouldn't be lost with the move to RTDS. I don't think it is practical to be changing the weights from userspace on a regular basis. This is something that would need to be done only periodically (at most every few seconds, and the kernel code would need to deal with the weights in a similar manner as the current round-robin and QoS code does today. To be honest, I can't really see the benefit of the RTDS tree as described instead of the current QoS weighted random allocator?

            ??PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. ??

            Agreed. As discussed before, striped files can be implemented by changing the weights of OSTs dynamically when allocating objects for a single file.

            lixi Li Xi (Inactive) added a comment - ??PS: as Jinshan wrote, this definitely needs to work properly for striped files, as OST pools and QoS allocators do today. While many sites use 1-stripe files as the default today, I think that PFL will change this significantly in the future. ?? Agreed. As discussed before, striped files can be implemented by changing the weights of OSTs dynamically when allocating objects for a single file.

            there is already a mechanism for OSTs to indicate that they are undergoing RAID rebuild or are otherwise degraded - the ofd.*.degraded parameter, which can be set locally on the OSS and is communicated to the MDS to avoid allocations on that OST if possible.

            Goot to know. I think this parameter works like mute button. And RTDS works like a volume knob.

            lixi Li Xi (Inactive) added a comment - there is already a mechanism for OSTs to indicate that they are undergoing RAID rebuild or are otherwise degraded - the ofd.*.degraded parameter, which can be set locally on the OSS and is communicated to the MDS to avoid allocations on that OST if possible. Goot to know. I think this parameter works like mute button. And RTDS works like a volume knob.

            ??
            I'm not a big fan of removing OST pools for RTDS, unless RTDS essentially includes all of the functionality of OST pools (i.e. can be named and explicitly requested by the user). The OST pool functionality has gotten better in 2.9/2.10 in terms of inheriting defaults. Being able to specify OST pools/weights by UID/GID/jobid/projid (and filename, extension, etc) is a great enhancement, but doesn't mean that OST pools should be removed. Essentially, OST pools is RTDS where the weight is 1 for every OST in the pool, and 0 for OSTs not in the pool.??

            Yeah, I agree it is too aggressive to remove OST pool. I might need to get ride of this idea

            Currently, a RTDS tree is allocated for each OST pool. That means, the OST pool is fully supported by RTDS. Also, I am going to implement UID/GID/jobid/projid matching rules for OST pool/RTDS. So, RTDS will be able to add new features to OST pools.

            lixi Li Xi (Inactive) added a comment - ?? I'm not a big fan of removing OST pools for RTDS, unless RTDS essentially includes all of the functionality of OST pools (i.e. can be named and explicitly requested by the user). The OST pool functionality has gotten better in 2.9/2.10 in terms of inheriting defaults. Being able to specify OST pools/weights by UID/GID/jobid/projid (and filename, extension, etc) is a great enhancement, but doesn't mean that OST pools should be removed. Essentially, OST pools is RTDS where the weight is 1 for every OST in the pool, and 0 for OSTs not in the pool.?? Yeah, I agree it is too aggressive to remove OST pool. I might need to get ride of this idea Currently, a RTDS tree is allocated for each OST pool. That means, the OST pool is fully supported by RTDS. Also, I am going to implement UID/GID/jobid/projid matching rules for OST pool/RTDS. So, RTDS will be able to add new features to OST pools.

            People

              lixi_wc Li Xi
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated: