Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • 17614

    Description

      We'd like to be able to perturb the timing of request processing at the PtlRPC layer with the goal being to simulate high server load, and find and expose timing related problems.

      Our initial idea is to create an NRS policy that will delay request handling for some configurable amount of time. When the policy is started and a request arrives the policy will calculate an offset, within a defined, user-configurable range, from the request arrival time to set a request "start time". We can use the cfs_binheap implementation to store these requests and sort them based on this "start time". Request's are then removed from the binheap for handling only once we've reached/passed their start time. We could also choose to only delay some % of requests by allowing the request enqueue to fallback to FIFO (or whatever).

      I have an initial implementation mostly done (just need to finish up lprocfs stuff). I appreciate any thoughts on this approach.

      Attachments

        Issue Links

          Activity

            [LU-6283] NRS Delay Policy
            pjones Peter Jones added a comment -

            ok I am going to stop responding to these now. Hopefully you check your email before I get too many more of these...

            pjones Peter Jones added a comment - ok I am going to stop responding to these now. Hopefully you check your email before I get too many more of these...

            Should this be closed?

            cfaber#1 Colin Faber [X] (Inactive) added a comment - Should this be closed?
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/14701/
            Subject: LU-6283 ptlrpc: Implement NRS Delay Policy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 588831e9eac38b8514f2a3e71516b44fa7c4bcce

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/14701/ Subject: LU-6283 ptlrpc: Implement NRS Delay Policy Project: fs/lustre-release Branch: master Current Patch Set: Commit: 588831e9eac38b8514f2a3e71516b44fa7c4bcce
            hornc Chris Horn added a comment -

            test_77l() is added as part of the feature implementation. It provides a good example of how other test cases might be written for other ptlrpc services. It verifies NRS delay works correctly for the ost_io service by generating a specific number of write RPCs, measuring the time it takes for the I/O to complete and comparing that with the amount of delay we configured for the ost_io service.

            I did similar testing in development using a range of delay values to ensure that the random delay logic was working as expected. For example:

            With a range of delay values between 1 and 10:

            lctl set_param ost.OSS.ost_io.nrs_policies=delay \
            				       ost.OSS.ost_io.nrs_delay_min=1 \
            				       ost.OSS.ost_io.nrs_delay_max10 \
            				       ost.OSS.ost_io.nrs_delay_pct=100
            

            We'd expect a 1MB write to take at least 1 second, and not much more than 10 seconds (for an idle filesystem it should effectively be 10-11 seconds).

            I performed testing in development to ensure the tunables were doing some proper sanitization of their inputs. Things like setting min > max, delay_pct < 0 and > 100, etc.

            This was all pretty informal so I do not have data on the results of those testing except to say that I fixed any bugs that I found.

            If the community feels that additional unit testing of the sort I've described is warranted then I can surely work to generate additional unit tests for sanityn.

            hornc Chris Horn added a comment - test_77l() is added as part of the feature implementation. It provides a good example of how other test cases might be written for other ptlrpc services. It verifies NRS delay works correctly for the ost_io service by generating a specific number of write RPCs, measuring the time it takes for the I/O to complete and comparing that with the amount of delay we configured for the ost_io service. I did similar testing in development using a range of delay values to ensure that the random delay logic was working as expected. For example: With a range of delay values between 1 and 10: lctl set_param ost.OSS.ost_io.nrs_policies=delay \ ost.OSS.ost_io.nrs_delay_min=1 \ ost.OSS.ost_io.nrs_delay_max10 \ ost.OSS.ost_io.nrs_delay_pct=100 We'd expect a 1MB write to take at least 1 second, and not much more than 10 seconds (for an idle filesystem it should effectively be 10-11 seconds). I performed testing in development to ensure the tunables were doing some proper sanitization of their inputs. Things like setting min > max, delay_pct < 0 and > 100, etc. This was all pretty informal so I do not have data on the results of those testing except to say that I fixed any bugs that I found. If the community feels that additional unit testing of the sort I've described is warranted then I can surely work to generate additional unit tests for sanityn.

            Chris - I think the only thing we are waiting on to land this feature is some kind of a feature test plan or test report. We are looking for some indication of what testing you have done to verify that this feature is working correctly and any tests added to the Lustre test suites to make sure this feature functions correctly in the future.

            Please attach a test plan/report to this ticket and, I think, the feature can move ahead for Lustre 2.10.

            If there's any questions about what we are looking for, you are welcome to contact me.

            Thanks, James

            jamesanunez James Nunez (Inactive) added a comment - Chris - I think the only thing we are waiting on to land this feature is some kind of a feature test plan or test report. We are looking for some indication of what testing you have done to verify that this feature is working correctly and any tests added to the Lustre test suites to make sure this feature functions correctly in the future. Please attach a test plan/report to this ticket and, I think, the feature can move ahead for Lustre 2.10. If there's any questions about what we are looking for, you are welcome to contact me. Thanks, James
            hornc Chris Horn added a comment -

            LUDOC-366 opened to track doc changes.

            hornc Chris Horn added a comment - LUDOC-366 opened to track doc changes.

            Also need something to add the design to: http://wiki.lustre.org/Projects

            bevans Ben Evans (Inactive) added a comment - Also need something to add the design to: http://wiki.lustre.org/Projects
            spitzcor Cory Spitz added a comment -

            We should open an LUDOC ticket to track any needed doc updates for this policy.

            spitzcor Cory Spitz added a comment - We should open an LUDOC ticket to track any needed doc updates for this policy.
            sarah Sarah Liu added a comment -

            Hello Cory,

            If Chris could upload the test plan in this ticket then I can just close LU-6583. LU-6583 is for tracking the test plan.

            sarah Sarah Liu added a comment - Hello Cory, If Chris could upload the test plan in this ticket then I can just close LU-6583. LU-6583 is for tracking the test plan.

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: