Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14501

NRS TBF UID: limit per "any" user?

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.6
    • CentOS 7
    • 9223372036854775807

    Description

      In my understanding of the TBF UID rules, it is not possible to set a limit per "any UID". In our testing, the default rule ( default {*} 10000 ) seems to include ALL UIDs, that is, 10,000 requests for all users. Please correct me if I'm wrong.

      Say, we want to limit ost.OSS.ost_io.nrs_tbf_rule to 100 reqs/user. Do we have to add a rule for each UID? In our case, we have ~5,400 users.

      Attachments

        Issue Links

          Activity

            [LU-14501] NRS TBF UID: limit per "any" user?
            lixi_wc Li Xi added a comment -

            is there a way to see which UID(s) have reached the rate limit?

            I don't think there is any existing way. And I doubt there is any way to implement that efficiently. The status of the TBF are changing all the time. The rate limitation could be 1000 RPC/s or so. That means, a UID could be limited by TBF now, and after 1ms, the limitation might be gone. Under such quick change, there seems no efficient way to dump the real-time status.

            But it doesn't mean we are not able to collect some statistics or summaries. For example, I think we could implement a mechanism to record the UIDs reached the limitation in the past period (e.g. an hour?). It will take a significant effort to implemnt though. And before that, need to analyze whether that is useful for your use cases, and whether that is useful for a broader use cases.

            lixi_wc Li Xi added a comment - is there a way to see which UID(s) have reached the rate limit? I don't think there is any existing way. And I doubt there is any way to implement that efficiently. The status of the TBF are changing all the time. The rate limitation could be 1000 RPC/s or so. That means, a UID could be limited by TBF now, and after 1ms, the limitation might be gone. Under such quick change, there seems no efficient way to dump the real-time status. But it doesn't mean we are not able to collect some statistics or summaries. For example, I think we could implement a mechanism to record the UIDs reached the limitation in the past period (e.g. an hour?). It will take a significant effort to implemnt though. And before that, need to analyze whether that is useful for your use cases, and whether that is useful for a broader use cases.

            Hi Li,

            Thanks for your response and clarification. After further testing, it seems to work as you describe. Reducing default {*} does indeed reduce the rate per UID and not globally. Then adding other per-uid rules does properly override the default (eg. I added a rule to exempt UID 0 and it's working).

            One thing I was wondering: is there a way to see which UID(s) have reached the rate limit?  I don't think there is any stats about that in /sys, but perhaps with a special lustre logging debug mask? That would help adapting our rates.

            sthiell Stephane Thiell added a comment - Hi Li, Thanks for your response and clarification. After further testing, it seems to work as you describe. Reducing default {*} does indeed reduce the rate per UID and not globally. Then adding other per-uid rules does properly override the default (eg. I added a rule to exempt UID 0 and it's working). One thing I was wondering: is there a way to see which UID(s) have reached the rate limit?  I don't think there is any stats about that in /sys, but perhaps with a special lustre logging debug mask? That would help adapting our rates.
            lixi_wc Li Xi added a comment - - edited

            the default rule ( default {*} 10000 ) seems to include ALL UIDs, that is, 10,000 requests for all users.

            For the TBF with UID type, I don't think this is correct. Each user with unique UID will have an dedicated different bucket for TBF UID, thus the rate limitation is for each user. If we want to make sure each user is only able to get rate of <= 100 request/sec from each NRS svcpt, setting "...nrs_tbf_rule='change default rate=100'" would be enough. (please note there might be several svcpt on a server, meaning the actually limitation will be 100 * svcpt). I don't think 5400 rules is needed for this use case .

            lixi_wc Li Xi added a comment - - edited the default rule ( default {*} 10000 ) seems to include ALL UIDs, that is, 10,000 requests for all users. For the TBF with UID type, I don't think this is correct. Each user with unique UID will have an dedicated different bucket for TBF UID, thus the rate limitation is for each user. If we want to make sure each user is only able to get rate of <= 100 request/sec from each NRS svcpt, setting " ...nrs_tbf_rule='change default rate=100' " would be enough. (please note there might be several svcpt on a server, meaning the actually limitation will be 100 * svcpt). I don't think 5400 rules is needed for this use case .

            People

              lixi_wc Li Xi
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: