[LU-18934] report latency stats by IO size - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.16.1
Labels:
- medium

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The following questions appear to be difficult to answer with currently existing Lustre statistics:

What is the average latency of writes of size 1MB?
What is the tail (p90) latency of reads of size 64MB?

brw_stats can almost answer this, but not quite. The two brw_stats metrics which come close are:

I/O time (1/1000s) - Shows the number of operations with a given latency range. This allows me to compute the percentile distribution of the number of reads and write latency, but does not allow me to break this down by operation size.

Because larger reads/writes will inherently take longer to service, this metric doesn't give me enough information to understand if the tail latency is abnormal or not. i.e. if a max latency of 256ms is all related to 64MB writes, that is probably fine, but if it's actually composed of 4k writes, then that is very concerning, and there could be some infrastructure or starvation issue.

disk I/O size - Shows the number of operations (IOPS) of a given size (in bytes). This can compute percentile distribution of operations, but can't be cleanly mapped to latencies.

At best, the closest I can get to answering the stated question using available metrics would be to look at disk I/O size and characterize the incoming workload as being composed of some distribution of operation sizes, and then to look at latency metrics and try to draw some conclusion e.g. - Because I'm seeing almost all 1MB writes, and most of the write latencies are around 2MB, therefore the average 1MB write must be around 2ms.

But this is very imprecise and falls apart if there is more than one workload going on concurrently. It's also difficult to assert tail latencies with this method.

Attachments

Issue Links

is related to

LU-18993 Add latency stats to rpc_stats

Open

Activity

[LU-18934] report latency stats by IO size

Andreas Dilger added a comment - 23/May/25 3:53 AM - edited

Looking at LU-18993 I realize that my proposed statistics were not distinguishing between read and write stats. It proposes an output format like:

RPC latency by size (pages):
4K: { read: { 256us: 1, }, }
32K: { read: { 512us: 1, }, write: { 512us: 1, }, }
1024K: { read: { 2048us: 1, }, }
4096K: { read: { 2048us: 8, 4096us: 2, }, write: { 2048us: 10, }, }

I'm not incredibly fond of that for a few reasons:

units are in pages, which is ambiguous to userspace without extra conversion
excessive braces make it visually cluttered
if there are a lot of buckets it would have very long lines and be hard to read
it is hard to parse from a simple script with "awk" or "grep" since there is no easy separation of read/write stats

I would propose something like the following, that embeds the read or write into the bucket name so that they can be separated more easily:

- latency_by_size:
  snapshot_time:   1748227480.980279268 secs.nsecs
  start_time:      1748133878.470624765 secs.nsecs
  elapsed_time:    93602.509654503 secs.nsecs
  wr_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
  wr_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
  wr_32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  wr_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  wr_1M: { 4096us: 378, 8192us: 134, 16384us: 195, }
  wr_2M: { 8192us: 132, 32768us: 582, }
  rd_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
  rd_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
  rd_16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  rd_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  rd_1M: { 1024us: 32, 2048us: 4, 4096us: 378, }
  rd_2M: { 2048us: 4, 4096us: 378, }

I don't have any real preference on wr_4K vs. 4K_wr, but am slightly less fond of 4Kw and 4Kr (kilowatt-ish and Kroner?).

Andreas Dilger added a comment - 23/May/25 3:53 AM - edited Looking at LU-18993 I realize that my proposed statistics were not distinguishing between read and write stats. It proposes an output format like: RPC latency by size (pages): 4K: { read: { 256us: 1, }, } 32K: { read: { 512us: 1, }, write: { 512us: 1, }, } 1024K: { read: { 2048us: 1, }, } 4096K: { read: { 2048us: 8, 4096us: 2, }, write: { 2048us: 10, }, } I'm not incredibly fond of that for a few reasons: units are in pages, which is ambiguous to userspace without extra conversion excessive braces make it visually cluttered if there are a lot of buckets it would have very long lines and be hard to read it is hard to parse from a simple script with "awk" or "grep" since there is no easy separation of read/write stats I would propose something like the following, that embeds the read or write into the bucket name so that they can be separated more easily: - latency_by_size: snapshot_time: 1748227480.980279268 secs.nsecs start_time: 1748133878.470624765 secs.nsecs elapsed_time: 93602.509654503 secs.nsecs wr_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } wr_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } wr_32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } wr_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } wr_1M: { 4096us: 378, 8192us: 134, 16384us: 195, } wr_2M: { 8192us: 132, 32768us: 582, } rd_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } rd_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } rd_16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } rd_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } rd_1M: { 1024us: 32, 2048us: 4, 4096us: 378, } rd_2M: { 2048us: 4, 4096us: 378, } I don't have any real preference on wr_4K vs. 4K_wr , but am slightly less fond of 4Kw and 4Kr (kilowatt-ish and Kroner?).

Andreas Dilger made changes - 10/May/25 1:13 PM

Link

New: This issue is related to LU-18993 [ LU-18993 ]

Nathaniel Clark added a comment - 22/Apr/25 8:11 PM

I think skipping the 0 values is fine, It looks like it will be mapped into HashMaps anyway.

Nathaniel Clark added a comment - 22/Apr/25 8:11 PM I think skipping the 0 values is fine, It looks like it will be mapped into HashMaps anyway.

Feng Lei added a comment - 20/Apr/25 1:56 AM

It's not necesary to add the first '-'. So the format my be like:

latency_by_size:
  4K:  { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 }
  8K:  { 32us:   0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
  ...

Feng Lei added a comment - 20/Apr/25 1:56 AM It's not necesary to add the first '-' . So the format my be like: latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 } 8K: { 32us: 0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } ...

Feng Lei added a comment - 20/Apr/25 12:48 AM

Personally I think it's not necessary to print 0 values.

Feng Lei added a comment - 20/Apr/25 12:48 AM Personally I think it's not necessary to print 0 values.

Peter Jones made changes - 19/Apr/25 2:23 PM

Assignee

Original: WC Triage [ wc-triage ]

New: Max Dilger [ mdilger ]

Andreas Dilger added a comment - 18/Apr/25 8:12 PM

utopiabound, flei, if the latency report proposed above is printing only the latencies seen for each bucket (i.e. not all of the unused buckets) does that pose a problem for YAML parsing (e.g. lustrefs_exporter), or does it need to print all of the latencies with 0 values:

- latency_by_size:
  4K:  { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 }
  8K:  { 32us:   0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
  16K: { 32us:   0, 64us: 0, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  32K: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  64K: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  128K: { 32us:  0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 32, 2048us: 4, 4096us: 378, }
  256K: { 32us:  0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 4, 4096us: 378, }
  1M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 378, 8192us: 134, 16384us: 195, }
  2M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 132, 32768us: 582, }
  4M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 213, }
  8M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 12,  }
  16M: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 0, 65536us: 5, 131072us: 14 }

On the one hand that makes the columns nicely aligned, but could make the lines very long. Would we also need to print the zero values at the end?

Andreas Dilger added a comment - 18/Apr/25 8:12 PM utopiabound , flei , if the latency report proposed above is printing only the latencies seen for each bucket (i.e. not all of the unused buckets) does that pose a problem for YAML parsing (e.g. lustrefs_exporter ), or does it need to print all of the latencies with 0 values: - latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 } 8K: { 32us: 0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } 16K: { 32us: 0, 64us: 0, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 32K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 64K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 128K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 32, 2048us: 4, 4096us: 378, } 256K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 4, 4096us: 378, } 1M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 378, 8192us: 134, 16384us: 195, } 2M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 132, 32768us: 582, } 4M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 213, } 8M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 12, } 16M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 0, 65536us: 5, 131072us: 14 } On the one hand that makes the columns nicely aligned, but could make the lines very long. Would we also need to print the zero values at the end?

Andreas Dilger made changes - 18/Apr/25 5:09 PM

Link

Original: This issue is related to GCP-29 [ GCP-29 ]

Andreas Dilger made changes - 18/Apr/25 5:09 PM

Link

New: This issue duplicates GCP-29 [ GCP-29 ]

Andreas Dilger added a comment - 18/Apr/25 5:09 PM - edited

I think it would be best to split the latency reporting into a new latency_stats file that is YAML-formatted, rather than continuing to expand brw_stats. While the information is related, the brw_stats ASCII "formatting" is difficult to parse, and changing the content of brw_stats will likely break existing parsers of that file (e.g. collectd and similar), so it is better to put this information into a new file.

A proposed output format would be like:

- latency_by_size:
  4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
  8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
  16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
  128K: { 1024us: 32, 2048us: 4, 4096us: 378, }
  256K: { 2048us: 4, 4096us: 378, }
  1M: { 4096us: 378, 8192us: 134, 16384us: 195, }
  2M: { 8192us: 132, 32768us: 582, }
  4M: { 16384us: 213, }
  8M: { 32768us: 12,  }
  16M: { 65536us: 5, 131072us: 14 }

Andreas Dilger added a comment - 18/Apr/25 5:09 PM - edited I think it would be best to split the latency reporting into a new latency_stats file that is YAML-formatted, rather than continuing to expand brw_stats . While the information is related, the brw_stats ASCII "formatting" is difficult to parse, and changing the content of brw_stats will likely break existing parsers of that file (e.g. collectd and similar), so it is better to put this information into a new file. A proposed output format would be like: - latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } 8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } 16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 128K: { 1024us: 32, 2048us: 4, 4096us: 378, } 256K: { 2048us: 4, 4096us: 378, } 1M: { 4096us: 378, 8192us: 134, 16384us: 195, } 2M: { 8192us: 132, 32768us: 582, } 4M: { 16384us: 213, } 8M: { 32768us: 12, } 16M: { 65536us: 5, 131072us: 14 }

People

Assignee:: Max Dilger

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Apr/25 4:12 PM

Updated:: 26/May/25 2:56 AM