[LU-15642] restore server read/write latency measurements Created: 12/Mar/22 Updated: 11/May/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Patrick Farrell |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
The patch https://review.whamcloud.com/46075 " While it may be necessary to account for the actual read/write bytes after the RPC transfer is complete, the code should account for the IO latency after the IO is complete, as it did before, rather than after the RPC is complete. The RPC stats at the OST level and on the client will include the full RPC latency, and the ofd stats should only account for the storage latency. |
| Comments |
| Comment by Steve Crusan [ 14/Mar/22 ] |
|
> While it may be necessary to account for the actual read/write bytes after the RPC transfer is complete, the code should account for the IO latency after the IO is complete, as it did before, rather than after the RPC is complete. The RPC stats at the OST level and on the client will include the full RPC latency, and the ofd stats should only account for the storage latency.
I don't know how much work this is, or if this is the best way to do things, but I think it might be useful to have both sets of counters (including for metadata operations):
You can certainly collect the per client counters (via llite), but it's a lot more difficult to collect/munge/etc all of the client data than having an overall average server side for general use, such as identifying network congestion outside of the Lustre servers' control. |
| Comment by Andreas Dilger [ 16/Mar/22 ] |
|
Steve, there are already stats that include the RPC network transfer time: # lctl get_param obdfilter.*.stats| egrep "read|write|=" obdfilter.testfs-OST0003.stats= read_bytes 26 samples [bytes] 1048576 4194304 104857600 433207581343744 write_bytes 25 samples [bytes] 4194304 4194304 104857600 439804651110400 read 26 samples [usecs] 269 34432 83969 2362054419 write 25 samples [usecs] 546 23458 87802 1197733848 # lctl get_param ost.OSS.ost_io.stats | egrep "read|write" ost_read 26 samples [usec] 2238 101196 469557 22435172467 ost_write 25 samples [usec] 6709 106630 774811 36953324201 |
| Comment by Gerrit Updater [ 16/Mar/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46833 |
| Comment by Andreas Dilger [ 16/Mar/22 ] |
|
Note above patch does not fix the IO latency stats, just some improvements while I was looking at this code. |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46833/ |