[LU-11407] Improve stats data Created: 19/Sep/18  Updated: 01/Aug/23  Resolved: 26/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13597 add processing time/latency, IO sizes... Resolved
is related to LU-16231 Lustre stats header incorrectly using... Resolved
is related to LU-15826 jobstats output can produce invalid y... Resolved
is related to LU-13123 Add list of client NIDs to job_stats ... Open
is related to LU-12872 Adding more stats into JOBSTATS Resolved
is related to LU-16555 print more YAML compatible special ch... Resolved
is related to LU-16599 clearing jobstats should match output... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

It would be useful to store and report the "job start" time for the JobStats. Currently we show in the obdfilter.*.job_stats file:

- job_id:          mythbackend.0
  snapshot_time:   1537384753
  read_bytes:      { samples:         321, unit: bytes, min:    4096, max: 4194304, sum:      1025404928 }
  write_bytes:     { samples:       12656, unit: bytes, min:   22028, max:  919476, sum:      5413800656 }
  sync:            { samples:       11168, unit:  reqs }
  statfs:          { samples:       31249, unit:  reqs }

but this doesn't tell us anything about when this job started, so we can't find the throughput or IOPS rates. It should be simple to store the first time this job reported IO so that we can have some idea about the rate.

A further enhancement would be to store the full brw_stats into the job_stats file, but that is a more complex change.



 Comments   
Comment by Andreas Dilger [ 19/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33201
Subject: LU-11407 obdclass: include start time in job_stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a40d022971d9078175c1f0ba1a399a03cfbc4c7

Comment by Gerrit Updater [ 02/Oct/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33270
Subject: LU-11407 obdclass: add start time to stats files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b26844d1b6d593810c04bb4663bb578f77ec0b86

Comment by Joe Grund [ 08/Oct/18 ]

What release(s) is this enhancement planning to land in?

Comment by Joe Grund [ 08/Oct/18 ]

Is there a sample of how the new output will look?

Comment by Li Xi [ 09/Oct/18 ]

We might need to create seperate tickets, but we have some requirements for the stats improvement:

  1. Printing the $time_interval, $sum / $time_interval, $sumsquare / $counter etc directly. Thus the collector doesn't need to parse the content and calculate the values.
  2. In order to get some meaningful distribution during a given time interval, e.g. the I/O size distribution (i.e. how many percentage each I/O size has) during the data collecting interval, we need to A) read the /proc entry and then B) clear the data counters by writing to the /proc entry. This works fine, but still has problem, because there is still time interval between step A) and B), at least in theory. It would be nice that Lustre can provide some kind of parameter or option. If the option is enabled, reading the data from /proc would clear the counters into zero at the same time.
Comment by James A Simmons [ 09/Oct/18 ]

Some time back a patch for the kernel code was pushed for 1) and it was rejected. Now if you really want it we could make "lctl get_param **.*stats" a wrapper around a function in liblustreapi that does these calculations for you. 

Comment by Andreas Dilger [ 12/Oct/18 ]

I was hoping to include it in 2.12 as a very minor enhancement, but if there is a significant issue affecting the parser then I could wait. I've also changed the job stats to have a sec.nsec timestamp to make it consistent with other stats.

I've refreshed the patch (will push soon) to include a field elapsed_time: which is the difference between the start time and current time, so Li Xi's parser doesn't need to do that. Doing all of the division in the kernel is problematic because the kernel does not support floating-point math.

The output for any stat that has snapshot_time: at the start will get two additional lines:

snapshot_time: 123456789.123456789 (secs.nsec)
start_time:    123456678.012345678
elapsed_time:       1111.111111111
Comment by Joe Grund [ 12/Oct/18 ]

No issue on my end, just want to know where I need to target.

Comment by Li Xi [ 19/Oct/18 ]

I don't think floating-point math is necessary, since 64 bit interger should be enough for most of the collectors. A rate with high precision doesn't help too much for analysis.

Anyway, the elapsed_time helps a lot. Thanks.

Comment by Gerrit Updater [ 01/Mar/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37764
Subject: LU-11407 tgt: cleanup job_stats output printing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cf1e6bb6403ff3114976c3c07e1aa65ab9230db3

Comment by Gerrit Updater [ 27/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33201/
Subject: LU-11407 obdclass: add start time to stats files
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ea2cd3af7bfabfa6876727ee44495f4c331bea8e

Comment by Gerrit Updater [ 26/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/37764/
Subject: LU-11407 tgt: cleanup job_stats output printing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 338381574b643da064e90e44d75be85d1be3a93c

Comment by Peter Jones [ 26/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 13/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48539
Subject: LU-11407 tgt: cleanup job_stats output printing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 8eea90f503a35942c8af25520d6485827f9370f3

Comment by Gerrit Updater [ 26/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48539/
Subject: LU-11407 tgt: cleanup job_stats output printing
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 56c0d6316fbf29eac019f5a7c823199592027b25

Comment by Gerrit Updater [ 25/Apr/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50734
Subject: LU-11407 obdclass: init osc.*.rpc_stats start_time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f8511d3f4ed5a7a2b426383f036208ec64de1cf5

Comment by Gerrit Updater [ 01/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50734/
Subject: LU-11407 obdclass: init osc.*.rpc_stats start_time
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0176531449899c30ebdeaf372464fd0685ca3645

Generated at Sat Feb 10 02:43:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.