Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.16.1
    • 3
    • 9223372036854775807

    Description

      The following questions appear to be difficult to answer with currently existing Lustre statistics:

      • What is the average latency of writes of size 1MB?
      • What is the tail (p90) latency of reads of size 64MB?

      brw_stats can almost answer this, but not quite. The two brw_stats metrics which come close are:

      I/O time (1/1000s) - Shows the number of operations with a given latency range. This allows me to compute the percentile distribution of the number of reads and write latency, but does not allow me to break this down by operation size.

      Because larger reads/writes will inherently take longer to service, this metric doesn't give me enough information to understand if the tail latency is abnormal or not. i.e. if a max latency of 256ms is all related to 64MB writes, that is probably fine, but if it's actually composed of 4k writes, then that is very concerning, and there could be some infrastructure or starvation issue.

      • disk I/O size - Shows the number of operations (IOPS) of a given size (in bytes). This can compute percentile distribution of operations, but can't be cleanly mapped to latencies.

      At best, the closest I can get to answering the stated question using available metrics would be to look at disk I/O size and characterize the incoming workload as being composed of some distribution of operation sizes, and then to look at latency metrics and try to draw some conclusion e.g. - Because I'm seeing almost all 1MB writes, and most of the write latencies are around 2MB, therefore the average 1MB write must be around 2ms.

      But this is very imprecise and falls apart if there is more than one workload going on concurrently. It's also difficult to assert tail latencies with this method.

      Attachments

        Issue Links

          Activity

            [LU-18934] report latency stats by IO size
            adilger Andreas Dilger added a comment - - edited

            Looking at LU-18993 I realize that my proposed statistics were not distinguishing between read and write stats. It proposes an output format like:

            RPC latency by size (pages):
            4K: { read: { 256us: 1, }, }
            32K: { read: { 512us: 1, }, write: { 512us: 1, }, }
            1024K: { read: { 2048us: 1, }, }
            4096K: { read: { 2048us: 8, 4096us: 2, }, write: { 2048us: 10, }, }
            

            I'm not incredibly fond of that for a few reasons:

            • units are in pages, which is ambiguous to userspace without extra conversion
            • excessive braces make it visually cluttered
            • if there are a lot of buckets it would have very long lines and be hard to read
            • it is hard to parse from a simple script with "awk" or "grep" since there is no easy separation of read/write stats

            I would propose something like the following, that embeds the read or write into the bucket name so that they can be separated more easily:

            - latency_by_size:
              snapshot_time:   1748227480.980279268 secs.nsecs
              start_time:      1748133878.470624765 secs.nsecs
              elapsed_time:    93602.509654503 secs.nsecs
              wr_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
              wr_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
              wr_32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              wr_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              wr_1M: { 4096us: 378, 8192us: 134, 16384us: 195, }
              wr_2M: { 8192us: 132, 32768us: 582, }
              rd_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
              rd_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
              rd_16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              rd_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              rd_1M: { 1024us: 32, 2048us: 4, 4096us: 378, }
              rd_2M: { 2048us: 4, 4096us: 378, }
            

            I don't have any real preference on wr_4K vs. 4K_wr, but am slightly less fond of 4Kw and 4Kr (kilowatt-ish and Kroner?).

            adilger Andreas Dilger added a comment - - edited Looking at LU-18993 I realize that my proposed statistics were not distinguishing between read and write stats. It proposes an output format like: RPC latency by size (pages): 4K: { read: { 256us: 1, }, } 32K: { read: { 512us: 1, }, write: { 512us: 1, }, } 1024K: { read: { 2048us: 1, }, } 4096K: { read: { 2048us: 8, 4096us: 2, }, write: { 2048us: 10, }, } I'm not incredibly fond of that for a few reasons: units are in pages, which is ambiguous to userspace without extra conversion excessive braces make it visually cluttered if there are a lot of buckets it would have very long lines and be hard to read it is hard to parse from a simple script with "awk" or "grep" since there is no easy separation of read/write stats I would propose something like the following, that embeds the read or write into the bucket name so that they can be separated more easily: - latency_by_size: snapshot_time: 1748227480.980279268 secs.nsecs start_time: 1748133878.470624765 secs.nsecs elapsed_time: 93602.509654503 secs.nsecs wr_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } wr_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } wr_32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } wr_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } wr_1M: { 4096us: 378, 8192us: 134, 16384us: 195, } wr_2M: { 8192us: 132, 32768us: 582, } rd_4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } rd_8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } rd_16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } rd_64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } rd_1M: { 1024us: 32, 2048us: 4, 4096us: 378, } rd_2M: { 2048us: 4, 4096us: 378, } I don't have any real preference on wr_4K vs. 4K_wr , but am slightly less fond of 4Kw and 4Kr (kilowatt-ish and Kroner?).
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-18993 [ LU-18993 ]

            I think skipping the 0 values is fine, It looks like it will be mapped into HashMaps anyway.

            utopiabound Nathaniel Clark added a comment - I think skipping the 0 values is fine, It looks like it will be mapped into HashMaps anyway.
            flei Feng Lei added a comment -

            It's not necesary to add the first '-'. So the format my be like:

            latency_by_size:
              4K:  { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 }
              8K:  { 32us:   0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
              ...
            flei Feng Lei added a comment - It's not necesary to add the first '-' . So the format my be like: latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 } 8K: { 32us: 0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } ...
            flei Feng Lei added a comment -

            Personally I think it's not necessary to print 0 values.

            flei Feng Lei added a comment - Personally I think it's not necessary to print 0 values.
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Max Dilger [ mdilger ]

            utopiabound, flei, if the latency report proposed above is printing only the latencies seen for each bucket (i.e. not all of the unused buckets) does that pose a problem for YAML parsing (e.g. lustrefs_exporter), or does it need to print all of the latencies with 0 values:

            - latency_by_size:
              4K:  { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 }
              8K:  { 32us:   0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
              16K: { 32us:   0, 64us: 0, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              32K: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              64K: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              128K: { 32us:  0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 32, 2048us: 4, 4096us: 378, }
              256K: { 32us:  0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 4, 4096us: 378, }
              1M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 378, 8192us: 134, 16384us: 195, }
              2M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 132, 32768us: 582, }
              4M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 213, }
              8M:  { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 12,  }
              16M: { 32us:   0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0,  2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 0, 65536us: 5, 131072us: 14 }
            

            On the one hand that makes the columns nicely aligned, but could make the lines very long. Would we also need to print the zero values at the end?

            adilger Andreas Dilger added a comment - utopiabound , flei , if the latency report proposed above is printing only the latencies seen for each bucket (i.e. not all of the unused buckets) does that pose a problem for YAML parsing (e.g. lustrefs_exporter ), or does it need to print all of the latencies with 0 values: - latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32 } 8K: { 32us: 0, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } 16K: { 32us: 0, 64us: 0, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 32K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 64K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 128K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 32, 2048us: 4, 4096us: 378, } 256K: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 4, 4096us: 378, } 1M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 378, 8192us: 134, 16384us: 195, } 2M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 132, 32768us: 582, } 4M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 213, } 8M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 12, } 16M: { 32us: 0, 64us: 0, 128us: 0, 256us: 1, 512us: 0, 1024us: 0, 2048us: 0, 4096us: 0, 8192us: 0, 16384us: 0, 32768us: 0, 65536us: 5, 131072us: 14 } On the one hand that makes the columns nicely aligned, but could make the lines very long. Would we also need to print the zero values at the end?
            adilger Andreas Dilger made changes -
            Link Original: This issue is related to GCP-29 [ GCP-29 ]
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates GCP-29 [ GCP-29 ]
            adilger Andreas Dilger added a comment - - edited

            I think it would be best to split the latency reporting into a new latency_stats file that is YAML-formatted, rather than continuing to expand brw_stats. While the information is related, the brw_stats ASCII "formatting" is difficult to parse, and changing the content of brw_stats will likely break existing parsers of that file (e.g. collectd and similar), so it is better to put this information into a new file.

            A proposed output format would be like:

            - latency_by_size:
              4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, }
              8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 }
              16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 }
              128K: { 1024us: 32, 2048us: 4, 4096us: 378, }
              256K: { 2048us: 4, 4096us: 378, }
              1M: { 4096us: 378, 8192us: 134, 16384us: 195, }
              2M: { 8192us: 132, 32768us: 582, }
              4M: { 16384us: 213, }
              8M: { 32768us: 12,  }
              16M: { 65536us: 5, 131072us: 14 }
            
            adilger Andreas Dilger added a comment - - edited I think it would be best to split the latency reporting into a new latency_stats file that is YAML-formatted, rather than continuing to expand brw_stats . While the information is related, the brw_stats ASCII "formatting" is difficult to parse, and changing the content of brw_stats will likely break existing parsers of that file (e.g. collectd and similar), so it is better to put this information into a new file. A proposed output format would be like: - latency_by_size: 4K: { 32us: 112, 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, } 8K: { 64us: 3, 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4 } 16K: { 128us: 1, 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 32K: { 256us: 1, 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 64K: { 512us: 1, 1024us: 32, 2048us: 4, 4096us: 378 } 128K: { 1024us: 32, 2048us: 4, 4096us: 378, } 256K: { 2048us: 4, 4096us: 378, } 1M: { 4096us: 378, 8192us: 134, 16384us: 195, } 2M: { 8192us: 132, 32768us: 582, } 4M: { 16384us: 213, } 8M: { 32768us: 12, } 16M: { 65536us: 5, 131072us: 14 }

            People

              mdilger Max Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: