[LU-13597] add processing time/latency, IO sizes, stats to job_stats Created: 23/May/20  Updated: 11/Jun/21  Resolved: 23/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12631 Report latency of client operations Resolved
is related to LU-13716 Interop: sanity test_205b: wrong job_... Resolved
is related to LU-13733 report client stats sumsq Resolved
is related to LU-7909 new stats to track time spent in spec... Open
is related to LU-11407 Improve stats data Resolved
is related to LU-12872 Adding more stats into JOBSTATS Resolved
Rank (Obsolete): 9223372036854775807

 Description   

It would be useful to add request processing times/latency and basic IO size information to the job_stats output, similar to brw_stats and other stats files.

This would allow monitoring per-job request processing performance, to allow differentiation between applications that are doing poorly-formed IO vs. jobs that may be suffering because of other jobs.



 Comments   
Comment by Peter Jones [ 29/May/20 ]

Emoly

Could you please look into this?

Thanks

Peter

Comment by Andreas Dilger [ 29/May/20 ]

Emily, implementing this should be similar to the change in patch https://review.whamcloud.com/36078 "LU-12631 llite: report latency for filesystem ops" that changed over to using LPROCFS_TYPE_LATENCY for the client stats.

We should already be tracking the processing times already in the OSS/MDS code, but they just need to be added into the per-job stats.

Comment by Emoly Liu [ 01/Jun/20 ]

adilger,

I saw that your patch at https://review.whamcloud.com/#/c/33201/ had already added time information to ***_stats structures. Should I make this one based on that one?

BTW, our current job_stats output is like:

obdfilter.lustre-OST0000.job_stats=
job_stats:
- job_id:          cp.0
  snapshot_time:   1591005135
  read_bytes:      { samples:           0, unit: bytes, min:       0, max:       0, sum:               0 }
  write_bytes:     { samples:           1, unit: bytes, min:    1689, max:    1689, sum:            1689 }
  getattr:         { samples:           0, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  punch:           { samples:           0, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  destroy:         { samples:           0, unit:  reqs }
  create:          { samples:           0, unit:  reqs }
  statfs:          { samples:           0, unit:  reqs }
  get_info:        { samples:           0, unit:  reqs }
  set_info:        { samples:           0, unit:  reqs }
  quotactl:        { samples:           0, unit:  reqs }

Besides adding time information, using LPROCFS_TYPE_LATENCY for all the requests, and anything else?

Comment by Andreas Dilger [ 01/Jun/20 ]

The read_bytes and write_bytes would stay with bytes units, but the others would change to usec units and show min/max/sum values. It would also be good to add a separate usec counter for the read and write operations, so that it is possible to calculate bandwidth.

Comment by Emoly Liu [ 03/Jun/20 ]

The job_stats from ofd have been changed. The output is like:

- job_id:          cp.0
  snapshot_time:   1591174981
  read_bytes:      { samples:           0, unit: bytes, min:       0, max:       0, sum:               0, sumsq:                  0 }
  write_bytes:     { samples:           1, unit: bytes, min:    2765, max:    2765, sum:            2765, sumsq:            7645225 }
  read_time:       { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  write_time:      { samples:           1, unit: usecs, min:      29, max:      29, sum:              29, sumsq:                841 }
  getattr:         { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  setattr:         { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  punch:           { samples:           1, unit: usecs, min:      49, max:      49, sum:              49, sumsq:               2401 }
  sync:            { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  destroy:         { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  create:          { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  statfs:          { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  get_info:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  set_info:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  quotactl:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  prealloc:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }

Do the ones from mdt need to be changed too?

mdt.lustre-MDT0000.job_stats=
job_stats:
- job_id:          cp.0
  snapshot_time:   1591175698
  open:            { samples:           1, unit:  reqs }
  close:           { samples:           1, unit:  reqs }
  mknod:           { samples:           1, unit:  reqs }
  link:            { samples:           0, unit:  reqs }
  unlink:          { samples:           0, unit:  reqs }
  mkdir:           { samples:           0, unit:  reqs }
  rmdir:           { samples:           0, unit:  reqs }
  rename:          { samples:           0, unit:  reqs }
  getattr:         { samples:           1, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  getxattr:        { samples:           1, unit:  reqs }
  setxattr:        { samples:           0, unit:  reqs }
  statfs:          { samples:           0, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  samedir_rename:  { samples:           0, unit:  reqs }
  crossdir_rename: { samples:           0, unit:  reqs }
  read_bytes:      { samples:           0, unit:  reqs, min:       0, max:       0, sum:               0 }
  write_bytes:     { samples:           0, unit:  reqs, min:       0, max:       0, sum:               0 }
  punch:           { samples:           0, unit:  reqs }
  migrate:         { samples:           0, unit:  reqs }
Comment by Emoly Liu [ 03/Jun/20 ]

Emoly Liu (emoly@whamcloud.com) uploaded a new patch:https://review.whamcloud.com/38816
Subject: LU-13597 ofd: add more information to job_stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9aff2fad827da1b1dc521ed29b001f003a4b30d

Comment by Andreas Dilger [ 04/Jun/20 ]
  write_bytes:     { samples:           1, unit: bytes, min:    2765, max:    2765, sum:            2765, sumsq:            7645225 }
  write_time:      { samples:           1, unit: usecs, min:      29, max:      29, sum:              29, sumsq:                841 }
  punch:           { samples:           1, unit: usecs, min:      49, max:      49, sum:              49, sumsq:               2401 }

This looks good, writing about 95MB/s...

  read_bytes:      { samples:           0, unit:  reqs, min:       0, max:       0, sum:               0 }
  write_bytes:     { samples:           0, unit:  reqs, min:       0, max:       0, sum:               0 }

This looks like a pre-existing bug - it should be units of "bytes" and not "reqs".

In any case, the MDS jobstats should also be updated to report usec.

Comment by Emoly Liu [ 04/Jun/20 ]

The job_stats output of MDT is like:

- job_id:          touch.500
  snapshot_time:   1591267859
  open:            { samples:           1, unit: usecs, min:     146, max:     146, sum:             146, sumsq:              21316 }
  close:           { samples:           1, unit: usecs, min:      17, max:      17, sum:              17, sumsq:                289 }
  mknod:           { samples:           1, unit: usecs, min:     124, max:     124, sum:             124, sumsq:              15376 }
  link:            { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  unlink:          { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  mkdir:           { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  rmdir:           { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  rename:          { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  getattr:         { samples:           1, unit: usecs, min:       8, max:       8, sum:               8, sumsq:                 64 }
  setattr:         { samples:           1, unit: usecs, min:      27, max:      27, sum:              27, sumsq:                729 }
  getxattr:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  setxattr:        { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  statfs:          { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  sync:            { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  samedir_rename:  { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  crossdir_rename: { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  read_bytes:      { samples:           0, unit: bytes, min:       0, max:       0, sum:               0, sumsq:                  0 }
  write_bytes:     { samples:           0, unit: bytes, min:       0, max:       0, sum:               0, sumsq:                  0 }
  punch:           { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }
  migrate:         { samples:           0, unit: usecs, min:       0, max:       0, sum:               0, sumsq:                  0 }

Since the unit is changed from reqs to usecs, sometimes the sumsq looks big.

setattr:         { samples:           1, unit: usecs, min:   15492, max:   15492, sum:           15492, sumsq:          240002064 }

 

Comment by Gerrit Updater [ 23/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38816/
Subject: LU-13597 ofd: add more information to job_stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cd8fb1e8d300c0622200d6b25d187d8d0472e79f

Comment by Peter Jones [ 23/Jun/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.