Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0

    Description

      For isolation of workloads across multiple sub-tenants of a filesystem, it would be useful to allow registering an NRS TBF rule for a nodemap. This can be proxied to some extent by setting a TBF rule for a project ID, but this doesn't work if there are multiple project IDs used by a single nodemap.

      Attachments

        Issue Links

          Activity

            [LU-17902] add NRS TBF policy for nodemap
            qian_wc Qian Yingjin added a comment -

            improved "fair share" balancing between buckets

            What's the "faire share" meaning?

             

            I have a question for node map TBF support:

            Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class?

            I think both can be used as key to identify a nodemap. 

            qian_wc Qian Yingjin added a comment - improved "fair share" balancing between buckets What's the "faire share" meaning?   I have a question for node map TBF support: Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class? I think both can be used as key to identify a nodemap. 
            qian_wc Qian Yingjin added a comment -

            Please note that a Lustre client should not have rate control with two different TBF class:

            See LU-7982 for details:

            When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

            qian_wc Qian Yingjin added a comment - Please note that a Lustre client should not have rate control with two different TBF class: See LU-7982 for details: When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.
            qian_wc Qian Yingjin added a comment -

            I am currently working on TBF for project ID.

             If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate.

            We have a patch for PCC implementing the similar functionality:

            LU-13881 pcc: comparator support for PCC rules

            https://review.whamcloud.com/c/fs/lustre-release/+/39585

            TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator.

             

            We also have a patch for aggregate shared rate limiting for TBF:

            https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/

             

            i.e. 

            lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1".

             

            If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.

             

             

            qian_wc Qian Yingjin added a comment - I am currently working on TBF for project ID.  If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate. We have a patch for PCC implementing the similar functionality: LU-13881 pcc: comparator support for PCC rules https://review.whamcloud.com/c/fs/lustre-release/+/39585 TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator.   We also have a patch for aggregate shared rate limiting for TBF: https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/   i.e.  lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1".   If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.    

            Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

            Ok, my bad. So there is no overhead to check if an RPC is in a nodemap.

            That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

            Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)?
            10000 RPC/s default rate is adjusted for 1 node.

            eaujames Etienne Aujames added a comment - Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again. Ok, my bad. So there is no overhead to check if an RPC is in a nodemap. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)? 10000 RPC/s default rate is adjusted for 1 node.

            I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

            Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

            sebastien Sebastien Buisson added a comment - I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.
            qian_wc Qian Yingjin added a comment - - edited

            The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this:

            default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate
            

            If a nodemap is bigger than another one, it should have more BW.

            Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")?

            Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

            qian_wc Qian Yingjin added a comment - - edited The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this: default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate If a nodemap is bigger than another one, it should have more BW. Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")? Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

            I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions.

            I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

            The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node.

            That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

            • this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap
            • the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed

            I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

            adilger Andreas Dilger added a comment - I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

            I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. This makes the implementation of a new "nodemap" policy relatively easy.

            The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this:

            default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate
            

            If a nodemap is bigger than another one, it should have more BW.

            Note, that we could achieve something like this manually with "tbf nid": we can declare tbf-rules with the same NID ranges as for the nodemaps and compute a rate to apply.

            eaujames Etienne Aujames added a comment - I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. This makes the implementation of a new "nodemap" policy relatively easy. The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this: default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate If a nodemap is bigger than another one, it should have more BW. Note, that we could achieve something like this manually with "tbf nid": we can declare tbf-rules with the same NID ranges as for the nodemaps and compute a rate to apply.

            It looks fine to me, as NRS arbitration is carried out on server side, just as nodemap's. I can see that NRS TBF currently supports providing uids/gids, jobids, opcodes or nids. So I think it would make sense to have the nodemap <-> NRS association defined on NRS side, by introducing a new parameter: a nodemap name. As Andreas mentioned, the export has a reference to the nodemap being used, so that would make it easy to find the matching entries I guess.

            sebastien Sebastien Buisson added a comment - It looks fine to me, as NRS arbitration is carried out on server side, just as nodemap's. I can see that NRS TBF currently supports providing uids/gids, jobids, opcodes or nids. So I think it would make sense to have the nodemap <-> NRS association defined on NRS side, by introducing a new parameter: a nodemap name. As Andreas mentioned, the export has a reference to the nodemap being used, so that would make it easy to find the matching entries I guess.

            sebastien, eaujames, qian_wc, what do you think of using TBF for a nodemap? That would allow tagging all the incoming RPCs with the nodemap that the export is using, and directly managing the RPCs of a sub-tenant instead of using projid as a proxy for this. I think the TBF rule would specify the nodemap name, but could internally match based on a pointer to the nodemap directly, if that was more efficient. It would also simplify balancing performance across multiple tenants by using proportional rule (e.g. 5000-2500-2500 based on nodemaps).

            adilger Andreas Dilger added a comment - sebastien , eaujames , qian_wc , what do you think of using TBF for a nodemap? That would allow tagging all the incoming RPCs with the nodemap that the export is using, and directly managing the RPCs of a sub-tenant instead of using projid as a proxy for this. I think the TBF rule would specify the nodemap name, but could internally match based on a pointer to the nodemap directly, if that was more efficient. It would also simplify balancing performance across multiple tenants by using proportional rule (e.g. 5000-2500-2500 based on nodemaps).

            People

              qian_wc Qian Yingjin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: