Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • None
    • 9223372036854775807

    Description

      It would be useful to store and report the "job start" time for the JobStats. Currently we show in the obdfilter.*.job_stats file:

      - job_id:          mythbackend.0
        snapshot_time:   1537384753
        read_bytes:      { samples:         321, unit: bytes, min:    4096, max: 4194304, sum:      1025404928 }
        write_bytes:     { samples:       12656, unit: bytes, min:   22028, max:  919476, sum:      5413800656 }
        sync:            { samples:       11168, unit:  reqs }
        statfs:          { samples:       31249, unit:  reqs }
      

      but this doesn't tell us anything about when this job started, so we can't find the throughput or IOPS rates. It should be simple to store the first time this job reported IO so that we can have some idea about the rate.

      A further enhancement would be to store the full brw_stats into the job_stats file, but that is a more complex change.

      Attachments

        Issue Links

          Activity

            [LU-11407] Improve stats data

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/37764/
            Subject: LU-11407 tgt: cleanup job_stats output printing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 338381574b643da064e90e44d75be85d1be3a93c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/37764/ Subject: LU-11407 tgt: cleanup job_stats output printing Project: fs/lustre-release Branch: master Current Patch Set: Commit: 338381574b643da064e90e44d75be85d1be3a93c

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33201/
            Subject: LU-11407 obdclass: add start time to stats files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ea2cd3af7bfabfa6876727ee44495f4c331bea8e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33201/ Subject: LU-11407 obdclass: add start time to stats files Project: fs/lustre-release Branch: master Current Patch Set: Commit: ea2cd3af7bfabfa6876727ee44495f4c331bea8e
            gerrit Gerrit Updater added a comment - - edited

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37764
            Subject: LU-11407 tgt: cleanup job_stats output printing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cf1e6bb6403ff3114976c3c07e1aa65ab9230db3

            gerrit Gerrit Updater added a comment - - edited Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37764 Subject: LU-11407 tgt: cleanup job_stats output printing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cf1e6bb6403ff3114976c3c07e1aa65ab9230db3
            lixi_wc Li Xi added a comment -

            I don't think floating-point math is necessary, since 64 bit interger should be enough for most of the collectors. A rate with high precision doesn't help too much for analysis.

            Anyway, the elapsed_time helps a lot. Thanks.

            lixi_wc Li Xi added a comment - I don't think floating-point math is necessary, since 64 bit interger should be enough for most of the collectors. A rate with high precision doesn't help too much for analysis. Anyway, the elapsed_time helps a lot. Thanks.
            joe.grund Joe Grund added a comment -

            No issue on my end, just want to know where I need to target.

            joe.grund Joe Grund added a comment - No issue on my end, just want to know where I need to target.

            I was hoping to include it in 2.12 as a very minor enhancement, but if there is a significant issue affecting the parser then I could wait. I've also changed the job stats to have a sec.nsec timestamp to make it consistent with other stats.

            I've refreshed the patch (will push soon) to include a field elapsed_time: which is the difference between the start time and current time, so Li Xi's parser doesn't need to do that. Doing all of the division in the kernel is problematic because the kernel does not support floating-point math.

            The output for any stat that has snapshot_time: at the start will get two additional lines:

            snapshot_time: 123456789.123456789 (secs.nsec)
            start_time:    123456678.012345678
            elapsed_time:       1111.111111111
            
            adilger Andreas Dilger added a comment - I was hoping to include it in 2.12 as a very minor enhancement, but if there is a significant issue affecting the parser then I could wait. I've also changed the job stats to have a sec.nsec timestamp to make it consistent with other stats. I've refreshed the patch (will push soon) to include a field elapsed_time: which is the difference between the start time and current time, so Li Xi's parser doesn't need to do that. Doing all of the division in the kernel is problematic because the kernel does not support floating-point math. The output for any stat that has snapshot_time: at the start will get two additional lines: snapshot_time: 123456789.123456789 (secs.nsec) start_time: 123456678.012345678 elapsed_time: 1111.111111111

            Some time back a patch for the kernel code was pushed for 1) and it was rejected. Now if you really want it we could make "lctl get_param **.*stats" a wrapper around a function in liblustreapi that does these calculations for you. 

            simmonsja James A Simmons added a comment - Some time back a patch for the kernel code was pushed for 1) and it was rejected. Now if you really want it we could make "lctl get_param **.*stats" a wrapper around a function in liblustreapi that does these calculations for you. 
            lixi_wc Li Xi added a comment -

            We might need to create seperate tickets, but we have some requirements for the stats improvement:

            1. Printing the $time_interval, $sum / $time_interval, $sumsquare / $counter etc directly. Thus the collector doesn't need to parse the content and calculate the values.
            2. In order to get some meaningful distribution during a given time interval, e.g. the I/O size distribution (i.e. how many percentage each I/O size has) during the data collecting interval, we need to A) read the /proc entry and then B) clear the data counters by writing to the /proc entry. This works fine, but still has problem, because there is still time interval between step A) and B), at least in theory. It would be nice that Lustre can provide some kind of parameter or option. If the option is enabled, reading the data from /proc would clear the counters into zero at the same time.
            lixi_wc Li Xi added a comment - We might need to create seperate tickets, but we have some requirements for the stats improvement: Printing the $time_interval, $sum / $time_interval, $sumsquare / $counter etc directly. Thus the collector doesn't need to parse the content and calculate the values. In order to get some meaningful distribution during a given time interval, e.g. the I/O size distribution (i.e. how many percentage each I/O size has) during the data collecting interval, we need to A) read the /proc entry and then B) clear the data counters by writing to the /proc entry. This works fine, but still has problem, because there is still time interval between step A) and B), at least in theory. It would be nice that Lustre can provide some kind of parameter or option. If the option is enabled, reading the data from /proc would clear the counters into zero at the same time.
            joe.grund Joe Grund added a comment -

            Is there a sample of how the new output will look?

            joe.grund Joe Grund added a comment - Is there a sample of how the new output will look?
            joe.grund Joe Grund added a comment -

            What release(s) is this enhancement planning to land in?

            joe.grund Joe Grund added a comment - What release(s) is this enhancement planning to land in?
            gerrit Gerrit Updater added a comment - - edited

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33270
            Subject: LU-11407 obdclass: add start time to stats files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b26844d1b6d593810c04bb4663bb578f77ec0b86

            gerrit Gerrit Updater added a comment - - edited Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33270 Subject: LU-11407 obdclass: add start time to stats files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b26844d1b6d593810c04bb4663bb578f77ec0b86

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: