Details

    • New Feature
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 13,634
    • 7604

    Description

      Architecture - Network Request Scheduler

      (NB: this is copy of http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler)

      Definitions

      • NRS
        Network Request Scheduler.
      • RPC Concurrency
        The number of RPC requests in-flight RPC between a given client and a server.
      • Active fan-out
        The number of clients with in-flight requests to a given server at a given time.
      • Offset stream
        The sequence of file or disk offsets in a stream of I/O requests.
      • "Before" relation (≤)
        File system operations that require ordering for correctness are related by "≤". For 2 operations a and b, if a ≤ b, then operation a must complete reading/writing file system state before operation b can start.
      • POP
        Partial Order Preservation. A filesystem's POP capability describes how its servers handle any "before" relations required on RPC sent to them. Servers with no POP capability have no concept of any "before" relation on incoming RPCs so clients are completely responsible for preserving it. Servers with local POP capability preserve the "before" relation within a single server, but clients are responsible for preserving any required order on RPCs sent to different servers. A set of servers with global POP capability preserves the "before" relation on all RPCs.

        Summary

      • The Network Request Scheduler manages incoming RPC requests on a server to provide improved and consistent performance. It does this primarily by ordering request execution to avoid client starvation and to present a workload to the backend filesystem that can be optimized more easily. It may also change RPC concurrency as active fan-out varies to reduce latency seen by the client and limit request buffering on the server.

        Requirements

      • POP Capability
        The NRS must implement any POP capability its clients require.

        Current Lustre servers have no POP capability therefore clients may never issue RPCs concurrently that have a "before" relation - viz. metadata RPCs are synchronous and dirty data must have been written back before locks can start to be released. This leaves the NRS free to reorder all incoming RPCs.

        Any POP capability should permit better RPC pipelining for improved throughput to single clients and better latency hiding when resolving lock conflicts.

        The implementation may choose to implement a very simple POP capability that only works for the most important use cases, since it can revert to synchronous client behaviour in complex cases.

        An implementation may create additional "before" relations between RPCs provided they do not conflict with any "real" ordering (i.e. no cycles in the global "before" graph). This may allow a more compact "wire" representation of the "before" relation and/or just a simpler overall implementation, at the expense of reducing the scope to optimize request order.

        Consider RPC requests a ≤ b. Implementations that could allow request b to reach a server before request a will have to log completed requests for the duration of a server epoch.

        A global POP capability seems to require too much and too fine-grained inter-server communication which will make it hard to implement efficiently. It should probably not be considered unless a significant use-case arises.
      • Scalability
        The number of RPC requests the server may buffer at any time is the product of RPC concurrency and active fan-out - i.e. potentially many thousands of requests. Request scheduling operations should have complexity of O(log) at most.
      • Offset Stream Consistency
        The backend filesystem allocator determines the disk offset stream when a given file is first written. It may even turn a random file offset stream into a substantially sequential disk offset stream. The disk offset stream is repeated when the file is read, provided the file offset stream hasn't changed. Request ordering should therefore be as reproducible as possible in the face of ordering "noise" caused by network unfairness or client races.

        Clients should pass a "hint" in RPC requests to ensure related offset streams can be identified, reordered and merged consistently on a multi-user cluster. This "hint" should also be passed through to the backend file system and used by its allocator. The "hint" may also become the basis of a resource reservation system to guarantee share of server resource to concurrent jobs.
      • Request Priority
        Request priorities enable important requests to be serviced with lower latency - e.g. writes required to clean a cache on a locking conflict. Note that high priority requests must not break any POP requirements.
      • RPC Concurrency
        There are conflicting pressures on RPC concurrency. It should be high when maximum individual client performance is required - e.g. when active fan-out is low on the server and there is spare server bandwidth, or when a client must clean its cache on a lock conflict. It should be low at times of high active fan-out to reduce buffering required on the server and to limit the latency of individual client requests.
      • Extendability
        The NRS must inter-operate with non-NRS-aware clients and peers, making "best efforts" scheduling descisions for them. This same policy must apply to successive client and server versions.

      Attachments

        1. 0001-Fix-for-minor-typo-in-nrs_crr_res_get.patch
          1.0 kB
        2. 0002-Add-assertions-where-svc-srv_rq_lock-is-meant-to-be-.patch
          4 kB
        3. 0003-Rework-movement-of-policies-in-nrs_policy_queued-lis.patch
          2 kB
        4. 0004-Make-some-functions-static.patch
          4 kB
        5. acc_sm_summary
          19 kB
        6. acc_sm_summary_2
          0.8 kB
        7. HLD_of_Lustre_NRS.pdf
          270 kB
        8. LUG_2012_NRS_tests_output.tar.bz2
          32 kB
        9. NRS_Bandwidth_Policies_LAD_2012.pdf
          170 kB
        10. nrs_dld_insp_6.patch
          73 kB
        11. nrs_generic_framework_dld_7.patch
          74 kB
        12. NRS_LAD_12.pdf
          583 kB
        13. nrs_liang_v2.patch
          54 kB
        14. nrs_orr_log_phys_offs_v1.patch
          47 kB
        15. nrs_orr_trr_v2.patch
          57 kB
        16. NRS_Scale_Testing_Results_LUG_2012_Nikitas_Angelinas_Xyratex.pdf
          1.41 MB
        17. nrs.patch
          54 kB
        18. NRS Conceptual Design__v1.0.doc
          101 kB
        19. NRS Conceptual Design__v1.0.pdf
          139 kB
        20. NRS Test Plan for Lustre 2.4__v1.2.doc
          110 kB
        21. NRS Test Plan for Lustre 2.4__v1.2.pdf
          182 kB
        22. orr_log_offs_v1.patch
          46 kB

        Issue Links

          Activity

            [LU-398] NRS (Network Request Scheduler )

            hmm, sorry I was trying to close it because I didn't realise there are sub tickets for this...

            liang Liang Zhen (Inactive) added a comment - hmm, sorry I was trying to close it because I didn't realise there are sub tickets for this...

            Great, thanks for finding that.

            nangelinas Nikitas Angelinas added a comment - Great, thanks for finding that.
            alexxy Alexey Shvetsov (Inactive) added a comment - Seems http://review.whamcloud.com/6141 should fix this error from LU-3179

            It seems like the compilers are catching a type check on the enum; I can upload a one/few-liner patch to fix this; it seems like the -Werror=switch is catching errors that might be meant to be caught by -Wswitch-enum. Or I might be reading a documentation for a previous version or something similar, please let me double-check.

            nangelinas Nikitas Angelinas added a comment - It seems like the compilers are catching a type check on the enum; I can upload a one/few-liner patch to fix this; it seems like the -Werror=switch is catching errors that might be meant to be caught by -Wswitch-enum. Or I might be reading a documentation for a previous version or something similar, please let me double-check.

            After merging http://review.whamcloud.com/4938 server want build with new compilers (tested gcc-4.7 and gcc-4.8)

            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c: In function ‘nrs_orr_ctl’:
            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:773:2: error: case value ‘33’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
            [-Werror=switch]
            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:781:2: error: case value ‘34’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
            [-Werror=switch]
            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:788:2: error: case value ‘35’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
            [-Werror=switch]
            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:795:2: error: case value ‘36’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
            [-Werror=switch]
            /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:802:2: error: case value ‘37’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
            LD /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/fid/built-in.o
            [-Werror=switch]

            alexxy Alexey Shvetsov (Inactive) added a comment - After merging http://review.whamcloud.com/4938 server want build with new compilers (tested gcc-4.7 and gcc-4.8) /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c: In function ‘nrs_orr_ctl’: /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:773:2: error: case value ‘33’ not in enumerated type ‘enum ptlrpc_nrs_ctl’ [-Werror=switch] /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:781:2: error: case value ‘34’ not in enumerated type ‘enum ptlrpc_nrs_ctl’ [-Werror=switch] /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:788:2: error: case value ‘35’ not in enumerated type ‘enum ptlrpc_nrs_ctl’ [-Werror=switch] /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:795:2: error: case value ‘36’ not in enumerated type ‘enum ptlrpc_nrs_ctl’ [-Werror=switch] /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:802:2: error: case value ‘37’ not in enumerated type ‘enum ptlrpc_nrs_ctl’ LD /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/fid/built-in.o [-Werror=switch]
            green Oleg Drokin added a comment -

            Just to recapture discussions that are happening in the http://review.whamcloud.com/#change,5274 patch:

            • new api needs to be introduced to fetch a next request from a queue if there's any ready for being served according to a policy function. This will also remove the request from the queue right away and should be suitable for use in ptlrpc_request_Get or even as a drop-in replacement of it. ptlrpc_request_get should not require any locking when called.
              (the current poll/remove API still remains for use in other place, currently that's health_check function where we just assess time next ready request spent waiting in the queue, but we don't want the request removed (ALSO need to ensure request has some sort of a reference to avoid a race where we fetch a request and then it's processed and freed before we have a chance to look inside, when using such an API).
            • the HP-request logic needs to be totally removed from ptlrpc request API (with the exception of still providing a function to determine if a request is HP or not), all this logic must be implemented inside of the policy functions as being best positioned to determine request order. The multi-queue fetch/get functions in nrs should be gone as a result and the only fetching calls remaining would be "get next request so I can serve it" and "get pointer to next request for inspection, don't remove from queue". All hp tracking including how many of what sort of requests are already running probably should be inside of a policy function too, though we might need a way for a policy function to determine number of idle thread/running normal requests or some such.
            green Oleg Drokin added a comment - Just to recapture discussions that are happening in the http://review.whamcloud.com/#change,5274 patch: new api needs to be introduced to fetch a next request from a queue if there's any ready for being served according to a policy function. This will also remove the request from the queue right away and should be suitable for use in ptlrpc_request_Get or even as a drop-in replacement of it. ptlrpc_request_get should not require any locking when called. (the current poll/remove API still remains for use in other place, currently that's health_check function where we just assess time next ready request spent waiting in the queue, but we don't want the request removed (ALSO need to ensure request has some sort of a reference to avoid a race where we fetch a request and then it's processed and freed before we have a chance to look inside, when using such an API). the HP-request logic needs to be totally removed from ptlrpc request API (with the exception of still providing a function to determine if a request is HP or not), all this logic must be implemented inside of the policy functions as being best positioned to determine request order. The multi-queue fetch/get functions in nrs should be gone as a result and the only fetching calls remaining would be "get next request so I can serve it" and "get pointer to next request for inspection, don't remove from queue". All hp tracking including how many of what sort of requests are already running probably should be inside of a policy function too, though we might need a way for a policy function to determine number of idle thread/running normal requests or some such.

            Also, I now see that there is no test that is enabling NRS policies to verify that they currently work, and continue to work in the future. This is something that the patch inspectors should have caught.

            Please submit a sanityn.sh test that enables each of the available policies in turn, and then runs some kind of test load on the multiple mount points (e.g. iozone and racer and fsx for 60s or 600s depending on SLOW=no or SLOW=yes) so that there will be sufficient loads to give NRS a workout.

            adilger Andreas Dilger added a comment - Also, I now see that there is no test that is enabling NRS policies to verify that they currently work, and continue to work in the future. This is something that the patch inspectors should have caught. Please submit a sanityn.sh test that enables each of the available policies in turn, and then runs some kind of test load on the multiple mount points (e.g. iozone and racer and fsx for 60s or 600s depending on SLOW=no or SLOW=yes) so that there will be sufficient loads to give NRS a workout.

            Yes Andreas, I will submit that patch asap.

            nangelinas Nikitas Angelinas added a comment - Yes Andreas, I will submit that patch asap.

            Nikitas, just to clarify - can you please submit a follow-up patch to address the issues with Isaac's comments on the http://review.whamcloud.com/4411 patch.

            adilger Andreas Dilger added a comment - Nikitas, just to clarify - can you please submit a follow-up patch to address the issues with Isaac's comments on the http://review.whamcloud.com/4411 patch.

            People

              liang Liang Zhen (Inactive)
              liang Liang Zhen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: