[LU-13597] add processing time/latency, IO sizes, stats to job_stats Created: 23/May/20 Updated: 11/Jun/21 Resolved: 23/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
It would be useful to add request processing times/latency and basic IO size information to the job_stats output, similar to brw_stats and other stats files. This would allow monitoring per-job request processing performance, to allow differentiation between applications that are doing poorly-formed IO vs. jobs that may be suffering because of other jobs. |
| Comments |
| Comment by Peter Jones [ 29/May/20 ] |
|
Emoly Could you please look into this? Thanks Peter |
| Comment by Andreas Dilger [ 29/May/20 ] |
|
Emily, implementing this should be similar to the change in patch https://review.whamcloud.com/36078 " We should already be tracking the processing times already in the OSS/MDS code, but they just need to be added into the per-job stats. |
| Comment by Emoly Liu [ 01/Jun/20 ] |
|
I saw that your patch at https://review.whamcloud.com/#/c/33201/ had already added time information to ***_stats structures. Should I make this one based on that one? BTW, our current job_stats output is like: obdfilter.lustre-OST0000.job_stats=
job_stats:
- job_id: cp.0
snapshot_time: 1591005135
read_bytes: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 }
write_bytes: { samples: 1, unit: bytes, min: 1689, max: 1689, sum: 1689 }
getattr: { samples: 0, unit: reqs }
setattr: { samples: 0, unit: reqs }
punch: { samples: 0, unit: reqs }
sync: { samples: 0, unit: reqs }
destroy: { samples: 0, unit: reqs }
create: { samples: 0, unit: reqs }
statfs: { samples: 0, unit: reqs }
get_info: { samples: 0, unit: reqs }
set_info: { samples: 0, unit: reqs }
quotactl: { samples: 0, unit: reqs }
Besides adding time information, using LPROCFS_TYPE_LATENCY for all the requests, and anything else? |
| Comment by Andreas Dilger [ 01/Jun/20 ] |
|
The read_bytes and write_bytes would stay with bytes units, but the others would change to usec units and show min/max/sum values. It would also be good to add a separate usec counter for the read and write operations, so that it is possible to calculate bandwidth. |
| Comment by Emoly Liu [ 03/Jun/20 ] |
|
The job_stats from ofd have been changed. The output is like: - job_id: cp.0
snapshot_time: 1591174981
read_bytes: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0, sumsq: 0 }
write_bytes: { samples: 1, unit: bytes, min: 2765, max: 2765, sum: 2765, sumsq: 7645225 }
read_time: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
write_time: { samples: 1, unit: usecs, min: 29, max: 29, sum: 29, sumsq: 841 }
getattr: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
setattr: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
punch: { samples: 1, unit: usecs, min: 49, max: 49, sum: 49, sumsq: 2401 }
sync: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
destroy: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
create: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
statfs: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
get_info: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
set_info: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
quotactl: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
prealloc: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
Do the ones from mdt need to be changed too? mdt.lustre-MDT0000.job_stats=
job_stats:
- job_id: cp.0
snapshot_time: 1591175698
open: { samples: 1, unit: reqs }
close: { samples: 1, unit: reqs }
mknod: { samples: 1, unit: reqs }
link: { samples: 0, unit: reqs }
unlink: { samples: 0, unit: reqs }
mkdir: { samples: 0, unit: reqs }
rmdir: { samples: 0, unit: reqs }
rename: { samples: 0, unit: reqs }
getattr: { samples: 1, unit: reqs }
setattr: { samples: 0, unit: reqs }
getxattr: { samples: 1, unit: reqs }
setxattr: { samples: 0, unit: reqs }
statfs: { samples: 0, unit: reqs }
sync: { samples: 0, unit: reqs }
samedir_rename: { samples: 0, unit: reqs }
crossdir_rename: { samples: 0, unit: reqs }
read_bytes: { samples: 0, unit: reqs, min: 0, max: 0, sum: 0 }
write_bytes: { samples: 0, unit: reqs, min: 0, max: 0, sum: 0 }
punch: { samples: 0, unit: reqs }
migrate: { samples: 0, unit: reqs }
|
| Comment by Emoly Liu [ 03/Jun/20 ] |
|
Emoly Liu (emoly@whamcloud.com) uploaded a new patch:https://review.whamcloud.com/38816 |
| Comment by Andreas Dilger [ 04/Jun/20 ] |
This looks good, writing about 95MB/s...
This looks like a pre-existing bug - it should be units of "bytes" and not "reqs". In any case, the MDS jobstats should also be updated to report usec. |
| Comment by Emoly Liu [ 04/Jun/20 ] |
|
The job_stats output of MDT is like: - job_id: touch.500
snapshot_time: 1591267859
open: { samples: 1, unit: usecs, min: 146, max: 146, sum: 146, sumsq: 21316 }
close: { samples: 1, unit: usecs, min: 17, max: 17, sum: 17, sumsq: 289 }
mknod: { samples: 1, unit: usecs, min: 124, max: 124, sum: 124, sumsq: 15376 }
link: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
unlink: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
mkdir: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
rmdir: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
rename: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
getattr: { samples: 1, unit: usecs, min: 8, max: 8, sum: 8, sumsq: 64 }
setattr: { samples: 1, unit: usecs, min: 27, max: 27, sum: 27, sumsq: 729 }
getxattr: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
setxattr: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
statfs: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
sync: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
samedir_rename: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
crossdir_rename: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
read_bytes: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0, sumsq: 0 }
write_bytes: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0, sumsq: 0 }
punch: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
migrate: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
Since the unit is changed from reqs to usecs, sometimes the sumsq looks big. setattr: { samples: 1, unit: usecs, min: 15492, max: 15492, sum: 15492, sumsq: 240002064 }
|
| Comment by Gerrit Updater [ 23/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38816/ |
| Comment by Peter Jones [ 23/Jun/20 ] |
|
Landed for 2.14 |