Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0, Lustre 2.16.1
-
3
-
9223372036854775807
Description
The following questions appear to be difficult to answer with currently existing Lustre statistics:
- What is the average latency of writes of size 1MB?
- What is the tail (p90) latency of reads of size 64MB?
brw_stats can almost answer this, but not quite. The two brw_stats metrics which come close are:
I/O time (1/1000s) - Shows the number of operations with a given latency range. This allows me to compute the percentile distribution of the number of reads and write latency, but does not allow me to break this down by operation size.
Because larger reads/writes will inherently take longer to service, this metric doesn't give me enough information to understand if the tail latency is abnormal or not. i.e. if a max latency of 256ms is all related to 64MB writes, that is probably fine, but if it's actually composed of 4k writes, then that is very concerning, and there could be some infrastructure or starvation issue.
- disk I/O size - Shows the number of operations (IOPS) of a given size (in bytes). This can compute percentile distribution of operations, but can't be cleanly mapped to latencies.
At best, the closest I can get to answering the stated question using available metrics would be to look at disk I/O size and characterize the incoming workload as being composed of some distribution of operation sizes, and then to look at latency metrics and try to draw some conclusion e.g. - Because I'm seeing almost all 1MB writes, and most of the write latencies are around 2MB, therefore the average 1MB write must be around 2ms.
But this is very imprecise and falls apart if there is more than one workload going on concurrently. It's also difficult to assert tail latencies with this method.
Attachments
Issue Links
- is related to
-
LU-18993 Add latency stats to rpc_stats
-
- Open
-
I think skipping the 0 values is fine, It looks like it will be mapped into HashMaps anyway.