Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0

    Description

      For isolation of workloads across multiple sub-tenants of a filesystem, it would be useful to allow registering an NRS TBF rule for a nodemap. This can be proxied to some extent by setting a TBF rule for a project ID, but this doesn't work if there are multiple project IDs used by a single nodemap.

      Attachments

        Issue Links

          Activity

            [LU-17902] add NRS TBF policy for nodemap
            qian_wc Qian Yingjin added a comment -

            I will rebase it later.

            For your question, 

            To make NRS TBF so smart to detect the node map changing (adding and removing), maybe we should both name and id of node map into the key of a TBF class. Each matching will compare both name and id of the node map? In the current version and previous version, we only compare one of them (either name, or id), not both.

            And the adding/removing node map will change the id, I think. Thus NRS TBF can detect the change during the matching.

            However, this means more memory used for each TBF class.

            qian_wc Qian Yingjin added a comment - I will rebase it later. For your question,  To make NRS TBF so smart to detect the node map changing (adding and removing), maybe we should both name and id of node map into the key of a TBF class. Each matching will compare both name and id of the node map? In the current version and previous version, we only compare one of them (either name, or id), not both. And the adding/removing node map will change the id, I think. Thus NRS TBF can detect the change during the matching. However, this means more memory used for each TBF class.

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57254
            Subject: LU-17902 nrs: add NRS TBF policy for nodemap
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bae391a8d8a698c7b4fc7d193785b5ef55cb7af9

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57254 Subject: LU-17902 nrs: add NRS TBF policy for nodemap Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bae391a8d8a698c7b4fc7d193785b5ef55cb7af9

            I have a question for node map TBF support:
            Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class?
            I think both can be used as key to identify a nodemap.

            Good question. nm_id is unique and could be used as the key. However, if a nodemap is removed and then recreated with the same name, its id will be different, but maybe admins would expect it to be treated as before? In this case, it would make sense to use nm_name as the key for TBF: even if the nodemap is recreated, the same TBF rules apply, and do not need to be fixed.

            sebastien Sebastien Buisson added a comment - I have a question for node map TBF support: Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class? I think both can be used as key to identify a nodemap. Good question. nm_id is unique and could be used as the key. However, if a nodemap is removed and then recreated with the same name, its id will be different, but maybe admins would expect it to be treated as before? In this case, it would make sense to use nm_name as the key for TBF: even if the nodemap is recreated, the same TBF rules apply, and do not need to be fixed.
            qian_wc Qian Yingjin added a comment -

            improved "fair share" balancing between buckets

            What's the "faire share" meaning?

             

            I have a question for node map TBF support:

            Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class?

            I think both can be used as key to identify a nodemap. 

            qian_wc Qian Yingjin added a comment - improved "fair share" balancing between buckets What's the "faire share" meaning?   I have a question for node map TBF support: Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class? I think both can be used as key to identify a nodemap. 
            qian_wc Qian Yingjin added a comment -

            Please note that a Lustre client should not have rate control with two different TBF class:

            See LU-7982 for details:

            When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

            qian_wc Qian Yingjin added a comment - Please note that a Lustre client should not have rate control with two different TBF class: See LU-7982 for details: When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.
            qian_wc Qian Yingjin added a comment -

            I am currently working on TBF for project ID.

             If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate.

            We have a patch for PCC implementing the similar functionality:

            LU-13881 pcc: comparator support for PCC rules

            https://review.whamcloud.com/c/fs/lustre-release/+/39585

            TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator.

             

            We also have a patch for aggregate shared rate limiting for TBF:

            https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/

             

            i.e. 

            lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1".

             

            If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.

             

             

            qian_wc Qian Yingjin added a comment - I am currently working on TBF for project ID.  If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate. We have a patch for PCC implementing the similar functionality: LU-13881 pcc: comparator support for PCC rules https://review.whamcloud.com/c/fs/lustre-release/+/39585 TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator.   We also have a patch for aggregate shared rate limiting for TBF: https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/   i.e.  lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1".   If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.    

            Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

            Ok, my bad. So there is no overhead to check if an RPC is in a nodemap.

            That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

            Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)?
            10000 RPC/s default rate is adjusted for 1 node.

            eaujames Etienne Aujames added a comment - Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again. Ok, my bad. So there is no overhead to check if an RPC is in a nodemap. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)? 10000 RPC/s default rate is adjusted for 1 node.

            I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

            Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

            sebastien Sebastien Buisson added a comment - I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.
            qian_wc Qian Yingjin added a comment - - edited

            The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this:

            default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate
            

            If a nodemap is bigger than another one, it should have more BW.

            Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")?

            Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

            qian_wc Qian Yingjin added a comment - - edited The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this: default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate If a nodemap is bigger than another one, it should have more BW. Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")? Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

            I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions.

            I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

            The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node.

            That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

            • this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap
            • the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed

            I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

            adilger Andreas Dilger added a comment - I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

            People

              qian_wc Qian Yingjin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: