Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • None
    • 9223372036854775807

    Description

      It would be useful to store and report the "job start" time for the JobStats. Currently we show in the obdfilter.*.job_stats file:

      - job_id:          mythbackend.0
        snapshot_time:   1537384753
        read_bytes:      { samples:         321, unit: bytes, min:    4096, max: 4194304, sum:      1025404928 }
        write_bytes:     { samples:       12656, unit: bytes, min:   22028, max:  919476, sum:      5413800656 }
        sync:            { samples:       11168, unit:  reqs }
        statfs:          { samples:       31249, unit:  reqs }
      

      but this doesn't tell us anything about when this job started, so we can't find the throughput or IOPS rates. It should be simple to store the first time this job reported IO so that we can have some idea about the rate.

      A further enhancement would be to store the full brw_stats into the job_stats file, but that is a more complex change.

      Attachments

        Issue Links

          Activity

            [LU-11407] Improve stats data
            gerrit Gerrit Updater added a comment - - edited

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37764
            Subject: LU-11407 tgt: cleanup job_stats output printing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cf1e6bb6403ff3114976c3c07e1aa65ab9230db3

            gerrit Gerrit Updater added a comment - - edited Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37764 Subject: LU-11407 tgt: cleanup job_stats output printing Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cf1e6bb6403ff3114976c3c07e1aa65ab9230db3
            lixi_wc Li Xi added a comment -

            I don't think floating-point math is necessary, since 64 bit interger should be enough for most of the collectors. A rate with high precision doesn't help too much for analysis.

            Anyway, the elapsed_time helps a lot. Thanks.

            lixi_wc Li Xi added a comment - I don't think floating-point math is necessary, since 64 bit interger should be enough for most of the collectors. A rate with high precision doesn't help too much for analysis. Anyway, the elapsed_time helps a lot. Thanks.
            joe.grund Joe Grund added a comment -

            No issue on my end, just want to know where I need to target.

            joe.grund Joe Grund added a comment - No issue on my end, just want to know where I need to target.

            I was hoping to include it in 2.12 as a very minor enhancement, but if there is a significant issue affecting the parser then I could wait. I've also changed the job stats to have a sec.nsec timestamp to make it consistent with other stats.

            I've refreshed the patch (will push soon) to include a field elapsed_time: which is the difference between the start time and current time, so Li Xi's parser doesn't need to do that. Doing all of the division in the kernel is problematic because the kernel does not support floating-point math.

            The output for any stat that has snapshot_time: at the start will get two additional lines:

            snapshot_time: 123456789.123456789 (secs.nsec)
            start_time:    123456678.012345678
            elapsed_time:       1111.111111111
            
            adilger Andreas Dilger added a comment - I was hoping to include it in 2.12 as a very minor enhancement, but if there is a significant issue affecting the parser then I could wait. I've also changed the job stats to have a sec.nsec timestamp to make it consistent with other stats. I've refreshed the patch (will push soon) to include a field elapsed_time: which is the difference between the start time and current time, so Li Xi's parser doesn't need to do that. Doing all of the division in the kernel is problematic because the kernel does not support floating-point math. The output for any stat that has snapshot_time: at the start will get two additional lines: snapshot_time: 123456789.123456789 (secs.nsec) start_time: 123456678.012345678 elapsed_time: 1111.111111111

            Some time back a patch for the kernel code was pushed for 1) and it was rejected. Now if you really want it we could make "lctl get_param **.*stats" a wrapper around a function in liblustreapi that does these calculations for you. 

            simmonsja James A Simmons added a comment - Some time back a patch for the kernel code was pushed for 1) and it was rejected. Now if you really want it we could make "lctl get_param **.*stats" a wrapper around a function in liblustreapi that does these calculations for you. 
            lixi_wc Li Xi added a comment -

            We might need to create seperate tickets, but we have some requirements for the stats improvement:

            1. Printing the $time_interval, $sum / $time_interval, $sumsquare / $counter etc directly. Thus the collector doesn't need to parse the content and calculate the values.
            2. In order to get some meaningful distribution during a given time interval, e.g. the I/O size distribution (i.e. how many percentage each I/O size has) during the data collecting interval, we need to A) read the /proc entry and then B) clear the data counters by writing to the /proc entry. This works fine, but still has problem, because there is still time interval between step A) and B), at least in theory. It would be nice that Lustre can provide some kind of parameter or option. If the option is enabled, reading the data from /proc would clear the counters into zero at the same time.
            lixi_wc Li Xi added a comment - We might need to create seperate tickets, but we have some requirements for the stats improvement: Printing the $time_interval, $sum / $time_interval, $sumsquare / $counter etc directly. Thus the collector doesn't need to parse the content and calculate the values. In order to get some meaningful distribution during a given time interval, e.g. the I/O size distribution (i.e. how many percentage each I/O size has) during the data collecting interval, we need to A) read the /proc entry and then B) clear the data counters by writing to the /proc entry. This works fine, but still has problem, because there is still time interval between step A) and B), at least in theory. It would be nice that Lustre can provide some kind of parameter or option. If the option is enabled, reading the data from /proc would clear the counters into zero at the same time.
            joe.grund Joe Grund added a comment -

            Is there a sample of how the new output will look?

            joe.grund Joe Grund added a comment - Is there a sample of how the new output will look?
            joe.grund Joe Grund added a comment -

            What release(s) is this enhancement planning to land in?

            joe.grund Joe Grund added a comment - What release(s) is this enhancement planning to land in?
            gerrit Gerrit Updater added a comment - - edited

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33270
            Subject: LU-11407 obdclass: add start time to stats files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b26844d1b6d593810c04bb4663bb578f77ec0b86

            gerrit Gerrit Updater added a comment - - edited Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33270 Subject: LU-11407 obdclass: add start time to stats files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b26844d1b6d593810c04bb4663bb578f77ec0b86

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33201
            Subject: LU-11407 obdclass: include start time in job_stats
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a40d022971d9078175c1f0ba1a399a03cfbc4c7

            adilger Andreas Dilger added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33201 Subject: LU-11407 obdclass: include start time in job_stats Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a40d022971d9078175c1f0ba1a399a03cfbc4c7

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: