[LU-13037] print tbf stats Created: 30/Nov/19  Updated: 01/Apr/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Mahmoud Hanafi Assignee: Li Xi
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

We would like a way to dump current tbf stats.
For example print all the UID being tracked and what are each UID bucket usage.



 Comments   
Comment by Peter Jones [ 30/Nov/19 ]

Li xi

How possible is this with the current design of TBF?

Peter

Comment by Li Xi [ 06/Dec/19 ]

I am working on a patch that works similar like jobstat, but that is going to take a while.

Comment by Li Xi [ 07/Dec/19 ]

Two entries are added into each service: nrs_tbf_stats_reg and nrs_tbf_stats_hp. The first one for regular requests, and the second one for high priority requests. And TBF information of all client classifications will be dumped from each entry. Following is an example of the dumped information:

# cat /sys/kernel/debug/lustre/ost/OSS/ost_io/nrs_tbf_stats_hp
- key:             _10.0.1.253@tcp_10_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             _10.0.1.253@tcp_4_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             dd.0_10.0.1.253@tcp_10_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             dd.0_10.0.1.253@tcp_4_0_0
  refs:            9
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
# cat /sys/kernel/debug/lustre/ost/OSS/ost_io/nrs_tbf_stats_reg 
- key:             _10.0.1.253@tcp_10_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             _10.0.1.253@tcp_4_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             dd.0_10.0.1.253@tcp_10_0_0
  refs:            0
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
- key:             dd.0_10.0.1.253@tcp_4_0_0
  refs:            9
  rule:            default
  rpc_rate:        10000
  ntoken:          2
  token_depth:     3
Comment by Gerrit Updater [ 07/Dec/19 ]

Li Xi (lixi@ddn.com) uploaded a new patch: https://review.whamcloud.com/36950
Subject: LU-13037 nrs: dump stats of TBF clients
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01af62b6a5305e9d4483dbbbb27aa003cd234099

Comment by Li Xi [ 07/Dec/19 ]

mhanafi Please feel free to let me know whether the dumped information is what you need.

Comment by Peter Jones [ 14/Dec/19 ]

mhanafi what do you think?

Comment by Mahmoud Hanafi [ 26/May/20 ]

Why do we get more than 1 stats for a specific cpt and queue_type. Here for uid 929411059 we see for cpt=0 and queue_type=reg we get 2 and for cpt=0 and queue_type=hp we get 2.

 nbp13-srv1 /sys/kernel/debug/lustre/ost/OSS/ost_io # cat /sys/kernel/debug/lustre/ost/OSS/ost_io/nrs_tbf_stats| grep -A 4  929411059
- uid:             929411059
  cpt:             0
  queue_type:      hp
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             0
  queue_type:      hp
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             0
  queue_type:      reg
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             0
  queue_type:      reg
  refs:            1
  rule:            default
--

Later for cpt=8 we only 1 for hp and 2 for reg.

- uid:             929411059
  cpt:             8
  queue_type:      hp
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             8
  queue_type:      reg
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             8
  queue_type:      reg
  refs:            1
  rule:            default
Comment by Peter Jones [ 08/Aug/20 ]

mhanafi I noticed this week that you are carrying this patch in your distribution. Sorry that we missed your question above. Have you had any other questions/comments about using this change? Do you think that we should proceed with landing it in its current form or is more work required?

Comment by Li Xi [ 10/Aug/20 ]

Sorry for late reply.

Why do we get more than 1 stats for a specific cpt and queue_type.

Understood comparing to this, a single stats would be easier to understand. However, this is determined by the internal design and implementation of request handling of Lustre which have good reasons too. And TBF has no other choice but to use and depend on it.

Lustre seperate requests into two different types, regular (reg) requests and high priority (hp) requests. And handling of these two types of requests are seperated in order to make sure high priority requests won't be blocked by many regular requests.

And Lustre divide CPUs into different patitions (cpt). Each patition handles RPC requests independently.

Because of these existing mechansim, TBF have to set the RPC rate limitations seperately (maybe with the same values), and the stats are seperated for each cpt and request type.

Comment by Mahmoud Hanafi [ 10/Aug/20 ]

I understand that we have hp and reg. But We get 2 request for the same cpt and queue_type type


 - uid:             929411059
  cpt:             0
  queue_type:      reg
  refs:            1
  rule:            default
--
- uid:             929411059
  cpt:             0
  queue_type:      reg
  refs:            1
  rule:            default
-- 

Yes Peter we would like to get this landed.

Comment by Li Xi [ 10/Aug/20 ]

mhanafi Sorry for the misunderstanding. I found a bug of the patch. Not sure whether that is the cause of the duplicated outputs. The patch will be refreshed soon anyway.

Comment by Qian Yingjin [ 13/Aug/20 ]

Hi Mahmoud,
I just updated the patch (https://review.whamcloud.com/#/c/36950/), could you please try it again?

Thanks,
Qian

Comment by Jay Lan (Inactive) [ 13/Aug/20 ]

I had compilation error in lustre-2.12.4 against CentOS 7.7 kernel :

Making all in .
/tmp/rpmbuild-lustre-jlan-UDyILlEP/BUILD/lustre-2.12.4/lustre/ptlrpc/nrs_tbf.c: In function 'nrs_tbf_stats_seq_show':
/tmp/rpmbuild-lustre-jlan-UDyILlEP/BUILD/lustre-2.12.4/lustre/ptlrpc/nrs_tbf.c:3882:2: error: format '%u' expects argument of type 'unsigned int', but argument 3 has type '__u64' [-Werror=format=]
seq_printf(p, "%u\n", cli->tc_rpc_rate);
^
/tmp/rpmbuild-lustre-jlan-UDyILlEP/BUILD/lustre-2.12.4/lustre/ptlrpc/nrs_tbf.c: At top level:
cc1: error: unrecognized command line option "-Wno-stringop-overflow" [-Werror]
cc1: error: unrecognized command line option "-Wno-stringop-truncation" [-Werror]
cc1: error: unrecognized command line option "-Wno-format-truncation" [-Werror]
cc1: all warnings being treated as errors
make[7]: *** [/tmp/rpmbuild-lustre-jlan-UDyILlEP/BUILD/lustre-2.12.4/lustre/ptlrpc/nrs_tbf.o] Error 1

I do not know how I got those "unrecgonized command line option". Before applying this patch it was compiled fine.

Comment by Qian Yingjin [ 14/Aug/20 ]

I build it based on the latest master branch.
It seems that tc_rpc_rate is __u32, while in 2.12.4, it was __u64

To make it pass the build for 2.12.4, you just need to modify it with:

seq_printf(p, "%llu\n", cli->tc_rpc_rate);

Regards,
Qian

Comment by Peter Jones [ 05/Sep/20 ]

jaylan any updates on testing this patch?

Comment by Peter Jones [ 01/Apr/21 ]

mhanafi jaylan you have not provided any feedback as to whether this patch meets your requirements. However, rumour has it that you are carrying this patch - does this mean that you can now provide us some feedback as to whether this patch is useful and should proceed with landing?

Comment by Mahmoud Hanafi [ 01/Apr/21 ]

We have been using the patch. But I think it needs addition work to be more useful. We will need to think about how it could improved.

Generated at Sat Feb 10 02:57:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.