[LU-15251] tbf gid rules ignored on MDS Created: 19/Nov/21  Updated: 19/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Li Xi
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9


Attachments: File oak-md1-s2_rpctrace_tbf_gid.dk.log.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello! Today we enabled tbf gid on Oak storage, both on MDS and OSS and noticed that new rules on MDS are not enforced. Only the rules with gid={0} and the default rule {*} seem to be utilized. All other gid-specific rules are ignored.

We used "tbf uid" before on this system. We disabled it by switching back to "fifo" first, and then enabled "tbf gid". Something like that:

lctl set_param mds.MDS.mdt.nrs_policies="tbf gid"
lctl set_param mds.MDS.mdt_readpage.nrs_policies="tbf gid"

lctl set_param mds.MDS.mdt.nrs_tbf_rule="start root gid={0} rate=10000"
lctl set_param mds.MDS.mdt_readpage.nrs_tbf_rule="start root gid={0} rate=10000"

lctl set_param mds.MDS.mdt.nrs_tbf_rule="change default rate=1000"

... then we added rules per GID (400+)...
[root@oak-md1-s2 ~]# lctl get_param mds.MDS.mdt.nrs_tbf_rule
mds.MDS.mdt.nrs_tbf_rule=
regular_requests:
CPT 0:
scg_prj_mvp {7456} 803, ref 0
scg_lab_twc {7122} 638, ref 0
scg_lab_mg1 {9159} 607, ref 0
scg_lab_irv {7152} 607, ref 0
scg_prj_scgs {7458} 709, ref 0
scg_prj_rttp {10137} 605, ref 0
scg_prj_pcgp {7450} 610, ref 0
... many other rules with ref 0...
ruthm {3199} 640, ref 0
yiorgo {3367} 1800, ref 0
root {0} 10000, ref 29                <<<
default {*} 1000, ref 195            <<<
 

 
The policy is started:

  - name: tbf gid
    state: started
    fallback: no
    queued: 2                   
    active: 0 

 

A user in a defined GID rule, for example I tested from GID 3199, is limited by the default rule (I tested by lowering the default rule {*}'s value of 1000 to 10 for the test and immediately noticed throttling. So the rule "ruthm {3199} 640, ref 0" above seems to be just ignored.

Per-GID rules are only defined for the mdt and mdt_readpage services in my case, not all of them.

On the OSS, the configuration is similar for the ost and ost_io services and per-GID rules are working as expected.

Servers and clients are running Lustre 2.12.7.

Attaching rpctrace debug output on MDS as oak-md1-s2_rpctrace_tbf_gid.dk.log.gz



 Comments   
Comment by Peter Jones [ 19/Nov/21 ]

Li Xi

Could you please advise?

Thanks

Peter

Generated at Sat Feb 10 03:16:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.