Details

    • Improvement
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0
    • Lustre 2.3.0
    • None
    • 22,639
    • 4569

    Description

      We should send the timestamp data from the node along with the performance
      counters. This will give us much more accurate performance data on a per-node
      and per-group basis.

      This is originally discussed in Oracle Bug 22639.

      Attachments

        Activity

          [LU-445] Send timestamps with LNet counters

          http://review.whamcloud.com/#change,3514 is the 1st patch for b2_2. Will submit the 2nd patch after landing this one.

          wang Wally Wang (Inactive) added a comment - http://review.whamcloud.com/#change,3514 is the 1st patch for b2_2. Will submit the 2nd patch after landing this one.

          Change http://review.whamcloud.com/#change,3192 has landed thereby addressing the backwards compatibility issue.

          doug Doug Oucharek (Inactive) added a comment - Change http://review.whamcloud.com/#change,3192 has landed thereby addressing the backwards compatibility issue.

          Change it to blocker

          liang Liang Zhen (Inactive) added a comment - Change it to blocker

          Please be more cautious about this and use at least 500 as the value, we can't use interval less than one second, even milliseconds stamp could not be so accurate but it shouldn't be less than 500.

          liang Liang Zhen (Inactive) added a comment - Please be more cautious about this and use at least 500 as the value, we can't use interval less than one second, even milliseconds stamp could not be so accurate but it shouldn't be less than 500.

          Ok, checking programmatically makes sense. Based on Liang's comment, I'm assuming I can use 100 as the value to check the timestamp against to determine whether to use local or remote timestamps.

          doug Doug Oucharek (Inactive) added a comment - Ok, checking programmatically makes sense. Based on Liang's comment, I'm assuming I can use 100 as the value to check the timestamp against to determine whether to use local or remote timestamps.
          liang Liang Zhen (Inactive) added a comment - - edited

          Yes we can distinguish this by value, the minimum interval of stat request would be 1 second which is 1000 milliseconds, but number of running tests on a node is almost impossible to be larger than 100.

          I thought about this at the beginning, but felt it could be confusing for code maintainers (OK, we are the maintainers), that's the reason I suggested Doug to fix by current way.

          However, I agree user will like it more if it can automatically decide which timestamp to choose, so if you think it's kind of acceptable style, I will not object to choose the easier way.

          liang Liang Zhen (Inactive) added a comment - - edited Yes we can distinguish this by value, the minimum interval of stat request would be 1 second which is 1000 milliseconds, but number of running tests on a node is almost impossible to be larger than 100. I thought about this at the beginning, but felt it could be confusing for code maintainers (OK, we are the maintainers), that's the reason I suggested Doug to fix by current way. However, I agree user will like it more if it can automatically decide which timestamp to choose, so if you think it's kind of acceptable style, I will not object to choose the easier way.

          Yes we can distinguish this by value, the minimum interval of stat request would be 1 second which is 1000 milliseconds, but number of running tests on a node is almost impossible to be larger than 100.
          I thought about this at the beginning, but think that could be confusing for code maintainer (OK, we are the maintainers), that's the reason I suggest Doug to fix by current way.
          However, I think user will like it more if it can automatically decide which timestamp to choose, so if you think it's kind of acceptable style, I will not object to choose the easier way.

          liang Liang Zhen (Inactive) added a comment - Yes we can distinguish this by value, the minimum interval of stat request would be 1 second which is 1000 milliseconds, but number of running tests on a node is almost impossible to be larger than 100. I thought about this at the beginning, but think that could be confusing for code maintainer (OK, we are the maintainers), that's the reason I suggest Doug to fix by current way. However, I think user will like it more if it can automatically decide which timestamp to choose, so if you think it's kind of acceptable style, I will not object to choose the easier way.

          Doug,
          how are the old timestamps invalid? Looking at the old patch it seems it would be possible to check this programattically. The number of tests is always going to be some small number, but the milliseconds will typically be larger values. Is it reasonable to say if they are so small as to be indistinguishable from the test count it doesn't really matter?

          adilger Andreas Dilger added a comment - Doug, how are the old timestamps invalid? Looking at the old patch it seems it would be possible to check this programattically. The number of tests is always going to be some small number, but the milliseconds will typically be larger values. Is it reasonable to say if they are so small as to be indistinguishable from the test count it doesn't really matter?

          The patch for making the original work backward compatible is: http://review.whamcloud.com/#change,3192

          doug Doug Oucharek (Inactive) added a comment - The patch for making the original work backward compatible is: http://review.whamcloud.com/#change,3192

          With Liang's recommendation, I am reopening this ticket to add a patch to make the change previously done backwards compatible.

          With the change as is, a 2.3 system running against a 2.2 or 2.1 system will have an invalid timestamp for doing bandwidth calculations.

          I plan to add a new flag to "lst stat" which will trigger the use of the remote timestamps. If the flag is not given, then the previous behaviour, using the local timestamp, will be done.

          This change will be set up to change the default from using local timestamps to using remote timestamp when the Lustre version hits 2.8.

          doug Doug Oucharek (Inactive) added a comment - With Liang's recommendation, I am reopening this ticket to add a patch to make the change previously done backwards compatible. With the change as is, a 2.3 system running against a 2.2 or 2.1 system will have an invalid timestamp for doing bandwidth calculations. I plan to add a new flag to "lst stat" which will trigger the use of the remote timestamps. If the flag is not given, then the previous behaviour, using the local timestamp, will be done. This change will be set up to change the default from using local timestamps to using remote timestamp when the Lustre version hits 2.8.

          People

            doug Doug Oucharek (Inactive)
            wang Wally Wang (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: