[LU-17902] add NRS TBF policy for nodemap - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
- TBF
- medium
- nodemap

Severity:
3
Rank (Obsolete):
9223372036854775807
Epic Link:
MTFL: Per-Tenant QoS with TBF

Description

For isolation of workloads across multiple sub-tenants of a filesystem, it would be useful to allow registering an NRS TBF rule for a nodemap. This can be proxied to some extent by setting a TBF rule for a project ID, but this doesn't work if there are multiple project IDs used by a single nodemap.

Attachments

Issue Links

is duplicated by

LU-18089 NRS TBF for nodemap

Resolved

is related to

LU-19077 nodemap must map ptlrpc_body_v2/v3 pb_uid/pb_gid/pb_prjid fields

Open

is related to

LU-18179 Implementation of Round-Robin/Fair Share response with Token Bucket Filters

Open

LU-14501 NRS TBF UID: limit per "any" user?

Open

LU-17166 add NRS TBF rule for projid

Open

Activity

[LU-17902] add NRS TBF policy for nodemap

Qian Yingjin added a comment - 03/Dec/24 4:46 AM

improved "fair share" balancing between buckets

What's the "faire share" meaning?

I have a question for node map TBF support:

Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class?

I think both can be used as key to identify a nodemap.

Qian Yingjin added a comment - 03/Dec/24 4:46 AM improved "fair share" balancing between buckets What's the "faire share" meaning? I have a question for node map TBF support: Should we use @lu_nodemap.nm_id or @lu_nodemap.nm_name as the key for TBF scheduling class? I think both can be used as key to identify a nodemap.

Qian Yingjin added a comment - 03/Dec/24 3:02 AM

Please note that a Lustre client should not have rate control with two different TBF class:

See LU-7982 for details:

When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

Qian Yingjin added a comment - 03/Dec/24 3:02 AM Please note that a Lustre client should not have rate control with two different TBF class: See LU-7982 for details: When using JOBD-based TBF rules, if multiple jobs run on the same client, the RPC rates of those jobs will be affected by each other. More precisely, the job that has high RPC rate limitation might get slow RPC rate actually. The reason of that is, the job that has slower RPC rate limitations might exaust the max-in-flight-RPC-number limitation, or the max-cache-pages limitation.

Qian Yingjin added a comment - 28/Nov/24 4:25 PM

I am currently working on TBF for project ID.

If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate.

We have a patch for PCC implementing the similar functionality:

~~LU-13881~~ pcc: comparator support for PCC rules

https://review.whamcloud.com/c/fs/lustre-release/+/39585

TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator.

We also have a patch for aggregate shared rate limiting for TBF:

https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/

i.e.

lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1".

If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.

Qian Yingjin added a comment - 28/Nov/24 4:25 PM I am currently working on TBF for project ID. If " This can be proxied to some extent by setting a TBF rule for a project ID" is the direction, then we can borrow the implementation of PCC code for aggregate project ID range and all requests in these project ID range have a TBF shared rate. We have a patch for PCC implementing the similar functionality: LU-13881 pcc: comparator support for PCC rules https://review.whamcloud.com/c/fs/lustre-release/+/39585 TBF rule can borrow the code from PCC to define a rule with a range of PROJID values with '<' or '>' comparator. We also have a patch for aggregate shared rate limiting for TBF: https://review.whamcloud.com/#/c/fs/lustre-release/+/56351/ i.e. lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start sharerate projid>{100}&projid<{1000} rate=3000 share=1". If we need TBF supporting node map directly, I need some time to learn the background knowledge about node map.

Etienne Aujames added a comment - 06/Jun/24 9:04 PM

Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

Ok, my bad. So there is no overhead to check if an RPC is in a nodemap.

That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)?
10000 RPC/s default rate is adjusted for 1 node.

Etienne Aujames added a comment - 06/Jun/24 9:04 PM Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again. Ok, my bad. So there is no overhead to check if an RPC is in a nodemap. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: Ok, I see the issue here. But, at least, can we increase the default rate for the nodemap policy (if the default is not changed via module parameter)? 10000 RPC/s default rate is adjusted for 1 node.

Sebastien Buisson added a comment - 06/Jun/24 6:25 AM

I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

Sebastien Buisson added a comment - 06/Jun/24 6:25 AM I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. Sure, after the nodemap lookup has been done, the export holds a reference to the nodemap. So NRS TBF would just have to use this info from the export, instead of doing a whole nodemap lookup again.

Qian Yingjin added a comment - 06/Jun/24 1:51 AM - edited

The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this:
default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate
If a nodemap is bigger than another one, it should have more BW.

Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")?

Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

Qian Yingjin added a comment - 06/Jun/24 1:51 AM - edited The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this: default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate If a nodemap is bigger than another one, it should have more BW. Can we directly use NID TBF? and each NID in the "nodemap" with its own rate limit (same value for all NIDs in a "nodemap")? Otherwise, all NID in a nodemap will share the IOPS bandwidth and some will exceed the limit for each NID and some will not...

Andreas Dilger added a comment - 05/Jun/24 9:37 PM

I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions.

I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible.

The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node.

That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally:

this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap
the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed

I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

Andreas Dilger added a comment - 05/Jun/24 9:37 PM I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. I don't read Sebastien's comment as implying that? Since the RPC handling has already looked up the nodemap I don't think it should be doing another nodemap NID lookup when NRS is processing the request again later? I think the NID->nodemap lookup is the most expensive part of having a nodemap, so it should only be done once if possible. The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. That is fine, and that rate is up to the administrator to define? I don't think we should be messing with the rates internally: this is confusing for the admin, if they set a rate of "5000" and then it is silently converted to "500000" if there are 1000 nodes in the nodemap the number of nodes in the nodemap will change over time, which would mean the TBF rate would change as clients are added and removed I don't think this can or should be emulated with TBF NID rules, both because the NIDs in a nodemap may change over time (especially with Sebastien's dynamic nodemap code) as well as adding complexity to manage/specify complex TBF rules for many NIDs in the nodemap. Specifying the rule with the nodemap name is very clear and directly ties the rates to all nodes that the "tenant" is using in the nodemap, even if the nodes change over time.

Etienne Aujames added a comment - 05/Jun/24 6:09 PM

I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. This makes the implementation of a new "nodemap" policy relatively easy.

The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this:

default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate

If a nodemap is bigger than another one, it should have more BW.

Note, that we could achieve something like this manually with "tbf nid": we can declare tbf-rules with the same NID ranges as for the nodemaps and compute a rate to apply.

Etienne Aujames added a comment - 05/Jun/24 6:09 PM I agree with Sebastien, we do not need to tag any requests, we can just "ask" if the RPC NID is in range of a nodemap via the existing functions. This makes the implementation of a new "nodemap" policy relatively easy. The default rate to apply to a "nodemap" class can be tricky. Unlike the "nid" policy, the rate will apply to a range of nodes instead of one node. So, maybe the default rate for a class can be determined like this: default_nodmap_rate = nbr_of_node_in_nodemap * tbf_default_rate If a nodemap is bigger than another one, it should have more BW. Note, that we could achieve something like this manually with "tbf nid": we can declare tbf-rules with the same NID ranges as for the nodemaps and compute a rate to apply.

Sebastien Buisson added a comment - 04/Jun/24 8:21 AM

It looks fine to me, as NRS arbitration is carried out on server side, just as nodemap's. I can see that NRS TBF currently supports providing uids/gids, jobids, opcodes or nids. So I think it would make sense to have the nodemap <-> NRS association defined on NRS side, by introducing a new parameter: a nodemap name. As Andreas mentioned, the export has a reference to the nodemap being used, so that would make it easy to find the matching entries I guess.

Sebastien Buisson added a comment - 04/Jun/24 8:21 AM It looks fine to me, as NRS arbitration is carried out on server side, just as nodemap's. I can see that NRS TBF currently supports providing uids/gids, jobids, opcodes or nids. So I think it would make sense to have the nodemap <-> NRS association defined on NRS side, by introducing a new parameter: a nodemap name. As Andreas mentioned, the export has a reference to the nodemap being used, so that would make it easy to find the matching entries I guess.

Andreas Dilger added a comment - 04/Jun/24 4:34 AM

sebastien, eaujames, qian_wc, what do you think of using TBF for a nodemap? That would allow tagging all the incoming RPCs with the nodemap that the export is using, and directly managing the RPCs of a sub-tenant instead of using projid as a proxy for this. I think the TBF rule would specify the nodemap name, but could internally match based on a pointer to the nodemap directly, if that was more efficient. It would also simplify balancing performance across multiple tenants by using proportional rule (e.g. 5000-2500-2500 based on nodemaps).

Andreas Dilger added a comment - 04/Jun/24 4:34 AM sebastien , eaujames , qian_wc , what do you think of using TBF for a nodemap? That would allow tagging all the incoming RPCs with the nodemap that the export is using, and directly managing the RPCs of a sub-tenant instead of using projid as a proxy for this. I think the TBF rule would specify the nodemap name, but could internally match based on a pointer to the nodemap directly, if that was more efficient. It would also simplify balancing performance across multiple tenants by using proportional rule (e.g. 5000-2500-2500 based on nodemaps).

People

Assignee:: Qian Yingjin

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Jun/24 4:23 AM

Updated:: 30/Jul/25 10:44 AM