[LU-398] NRS (Network Request Scheduler ) Created: 07/Jun/11  Updated: 07/Oct/16  Resolved: 07/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: New Feature Priority: Major
Reporter: Liang Zhen (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 0001-Fix-for-minor-typo-in-nrs_crr_res_get.patch     File 0002-Add-assertions-where-svc-srv_rq_lock-is-meant-to-be-.patch     File 0003-Rework-movement-of-policies-in-nrs_policy_queued-lis.patch     File 0004-Make-some-functions-static.patch     PDF File HLD_of_Lustre_NRS.pdf     File LUG_2012_NRS_tests_output.tar.bz2     Microsoft Word NRS Conceptual Design__v1.0.doc     PDF File NRS Conceptual Design__v1.0.pdf     Microsoft Word NRS Test Plan for Lustre 2.4__v1.2.doc     PDF File NRS Test Plan for Lustre 2.4__v1.2.pdf     PDF File NRS_Bandwidth_Policies_LAD_2012.pdf     PDF File NRS_LAD_12.pdf     PDF File NRS_Scale_Testing_Results_LUG_2012_Nikitas_Angelinas_Xyratex.pdf     File acc_sm_summary     File acc_sm_summary_2     Text File nrs.patch     Text File nrs_dld_insp_6.patch     File nrs_generic_framework_dld_7.patch     Text File nrs_liang_v2.patch     File nrs_orr_log_phys_offs_v1.patch     File nrs_orr_trr_v2.patch     File orr_log_offs_v1.patch    
Issue Links:
Blocker
is blocked by LU-3239 ofd_internal.h:518:ofd_info_init()) A... Resolved
Related
is related to LU-2947 OBD_FAIL_PTLRPC_HPREQ_* implementatio... Resolved
is related to LU-2936 nrs_svcpt2nrs()) ASSERTION( (!(hp) ||... Resolved
is related to LU-2981 sanity.sh test_17m test_77i: oops in ... Resolved
is related to LU-3238 ASSERTION( (!(moving_req ? CFS_ALLOC_... Resolved
is related to LU-4493 NRS ORR crash Resolved
is related to LU-3265 nrs_crrn_quantum proc files unreadabl... Closed
is related to LU-6283 NRS Delay Policy Resolved
is related to LU-6336 Refactor ptlrpc_nrs_request structure... Open
is related to LUDOC-79 NRS Doc Changes Closed
is related to LU-765 RPC rate control Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-1879 Write and attach test plan to Jira ti... Technical task Resolved Liang Zhen  
LU-2667 move NRS structures/definitions from ... Technical task Resolved WC Triage  
LU-3266 Regression tests for NRS policies Technical task Resolved WC Triage  
Bugzilla ID: 13,634
Rank (Obsolete): 7604

 Description   

Architecture - Network Request Scheduler

(NB: this is copy of http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler)

Definitions

  • NRS
    Network Request Scheduler.
  • RPC Concurrency
    The number of RPC requests in-flight RPC between a given client and a server.
  • Active fan-out
    The number of clients with in-flight requests to a given server at a given time.
  • Offset stream
    The sequence of file or disk offsets in a stream of I/O requests.
  • "Before" relation (≤)
    File system operations that require ordering for correctness are related by "≤". For 2 operations a and b, if a ≤ b, then operation a must complete reading/writing file system state before operation b can start.
  • POP
    Partial Order Preservation. A filesystem's POP capability describes how its servers handle any "before" relations required on RPC sent to them. Servers with no POP capability have no concept of any "before" relation on incoming RPCs so clients are completely responsible for preserving it. Servers with local POP capability preserve the "before" relation within a single server, but clients are responsible for preserving any required order on RPCs sent to different servers. A set of servers with global POP capability preserves the "before" relation on all RPCs.

    Summary

  • The Network Request Scheduler manages incoming RPC requests on a server to provide improved and consistent performance. It does this primarily by ordering request execution to avoid client starvation and to present a workload to the backend filesystem that can be optimized more easily. It may also change RPC concurrency as active fan-out varies to reduce latency seen by the client and limit request buffering on the server.

    Requirements

  • POP Capability
    The NRS must implement any POP capability its clients require.

    Current Lustre servers have no POP capability therefore clients may never issue RPCs concurrently that have a "before" relation - viz. metadata RPCs are synchronous and dirty data must have been written back before locks can start to be released. This leaves the NRS free to reorder all incoming RPCs.

    Any POP capability should permit better RPC pipelining for improved throughput to single clients and better latency hiding when resolving lock conflicts.

    The implementation may choose to implement a very simple POP capability that only works for the most important use cases, since it can revert to synchronous client behaviour in complex cases.

    An implementation may create additional "before" relations between RPCs provided they do not conflict with any "real" ordering (i.e. no cycles in the global "before" graph). This may allow a more compact "wire" representation of the "before" relation and/or just a simpler overall implementation, at the expense of reducing the scope to optimize request order.

    Consider RPC requests a ≤ b. Implementations that could allow request b to reach a server before request a will have to log completed requests for the duration of a server epoch.

    A global POP capability seems to require too much and too fine-grained inter-server communication which will make it hard to implement efficiently. It should probably not be considered unless a significant use-case arises.
  • Scalability
    The number of RPC requests the server may buffer at any time is the product of RPC concurrency and active fan-out - i.e. potentially many thousands of requests. Request scheduling operations should have complexity of O(log) at most.
  • Offset Stream Consistency
    The backend filesystem allocator determines the disk offset stream when a given file is first written. It may even turn a random file offset stream into a substantially sequential disk offset stream. The disk offset stream is repeated when the file is read, provided the file offset stream hasn't changed. Request ordering should therefore be as reproducible as possible in the face of ordering "noise" caused by network unfairness or client races.

    Clients should pass a "hint" in RPC requests to ensure related offset streams can be identified, reordered and merged consistently on a multi-user cluster. This "hint" should also be passed through to the backend file system and used by its allocator. The "hint" may also become the basis of a resource reservation system to guarantee share of server resource to concurrent jobs.
  • Request Priority
    Request priorities enable important requests to be serviced with lower latency - e.g. writes required to clean a cache on a locking conflict. Note that high priority requests must not break any POP requirements.
  • RPC Concurrency
    There are conflicting pressures on RPC concurrency. It should be high when maximum individual client performance is required - e.g. when active fan-out is low on the server and there is spare server bandwidth, or when a client must clean its cache on a lock conflict. It should be low at times of high active fan-out to reduce buffering required on the server and to limit the latency of individual client requests.
  • Extendability
    The NRS must inter-operate with non-NRS-aware clients and peers, making "best efforts" scheduling descisions for them. This same policy must apply to successive client and server versions.


 Comments   
Comment by Nathan Rutman [ 05/Jul/11 ]

Note that Xyratex is also working on this issue (MRP-73).
I'm going to attach current a DLD (pre-inspection) for community comment.

Comment by Nathan Rutman [ 05/Jul/11 ]

Xyratex NRS HLD

Comment by Liang Zhen (Inactive) [ 08/Jul/11 ]

Hi, here is my prototype and it has full implementation of binheap and client-round-robin based on binheap. It's not a full tested patch.

Comment by Liang Zhen (Inactive) [ 08/Jul/11 ]

I just tested the previous patch and found it has a bug, so I fixed it.

Comment by Nikitas Angelinas [ 11/Jul/11 ]

Hi Liang,

There's a minor typo in line 1242 in crr_req_compare(), the second req1 should be req2.

Comment by Liang Zhen (Inactive) [ 11/Jul/11 ]

Nikitas, thanks! it will change the whole logic though it's just a typo!

I'm doing some clean up now, after that will push it to a developing branch for your review(in one or two days), any comment and suggestion will be welcome. Also, I will review your DLD to see if I can merge ideas about framework into the prototype (I wouldn't be able to work on OBRR recently).
Would this be acceptable for you? I'm fine if you think there is something essentially wrong and need to restart, then we can continue the discussion on openforum.

Thanks again!

Comment by Nikitas Angelinas [ 12/Jul/11 ]

Hi Liang,

If you are willing to upload your code into a development repository, that sounds great. I have gone through some of the patch you posted, and I don't think there is anything essentially wrong with it; maybe some minor things that the patch could make use of, but as you said, it is meant to be a prototype. I will follow up with a more detailed discussion soon in the lustre-devel forum. I will get some time to factor in inspection comments for my dld, so should be able to upload a new version here soon. Thanks!

Comment by Liang Zhen (Inactive) [ 13/Jul/11 ]

I've uploaded the patch to git://git.whamcloud.com/fs/lustre-dev.git branch name is liang/b_nrs
(did some cleanup and added real client round-robin)

link to it:
http://git.whamcloud.com/?p=fs%2Flustre-dev.git;a=shortlog;h=refs%2Fheads%2Fliang%2Fb_nrs

Comment by Nikitas Angelinas [ 14/Jul/11 ]

Slightly more up-to-date version of our dld; I'll follow this up with a proper version once I factor in all inspection comments.

Comment by Liang Zhen (Inactive) [ 22/Jul/11 ]

I will update the developing branch to make it consistent with the slides.
Nikitas, could you please post your lastest DLD so I can review it and to see how we can merge our ideas?

Thanks
Liang

Comment by Nikitas Angelinas [ 23/Jul/11 ]

Liang, please use the latest patch file I uploaded; it is the most recent version of our DLD. I will upload a new version if we make any changes on that.

Thanks

Comment by Liang Zhen (Inactive) [ 02/Aug/11 ]

I'm thinking that we probably can remove these whole bunch of things and put them all into NRS, we can make ptlrpc service to be cleaner and more readable by doing this, it's just a preliminary thinking and I just put a note at here so I wouldn't forget it in the future:

include/lustre_net.h
        cfs_spinlock_t                  srv_rq_lock __cfs_cacheline_aligned;
        /** # reqs in either of the queues below */
        /** reqs waiting for service */
        cfs_list_t                      srv_request_queue;
        /** high priority queue */
        cfs_list_t                      srv_request_hpq;
        /** # incoming reqs */
        int                             srv_n_queued_reqs;
        /** # reqs being served */
        int                             srv_n_active_reqs;
        /** # HPreqs being served */
        int                             srv_n_active_hpreq;
        /** # hp requests handled */
        int                             srv_hpreq_count;
Comment by Liang Zhen (Inactive) [ 02/Aug/11 ]

Nikitas,
Do you think it's reasonable (my previous comment)?
I think if we can move those queues into NRS, then it doesn't make too much sense to leave those counters in the ptlrpc_service structure and protect NRS data by external lock (srv_rq_lock), we also can move some APIs into NRS and modularize "scheduler" logic in ptlrpc service.

Liang

Comment by Nikitas Angelinas [ 02/Aug/11 ]

Liang,

That makes perfect sense to me. I suspect nrs heads may be the most appropriate place for replacements to those fields, although the spinlock may require a separate (common to hp and normal requests) structure, or perhaps to be left in ptlrpc_service? I suspect that some APIs could be moved into NRS; were there some you were thinking of in particular?

Nikitas

Comment by Nikitas Angelinas [ 12/Aug/11 ]

Hi Liang,

I have gone through some of the updated code in the development repository. In general I am personally quite happy with what is there; I think the resource abstraction is more intuitive than the pair of target/object.

I'm attaching some small patches; obviously please review before applying. I will be away on holiday for the rest of the month, so unfortunately not able to do much more work on this at present, but will complete the review as soon as I come back, hope that is ok.

Thanks,
Nikitas

Comment by Liang Zhen (Inactive) [ 15/Aug/11 ]

Nikitas, thanks, I will review & push these patc hes into the developing branch, btw, Robert told me you should be able to push code to the developing branch now.

As I mentioned previously, the next thing I'm going to work on is replacing active RPC counters from ptlrpc/service.c with counters in RPC, which will made code cleaner.

Liang

Comment by Liang Zhen (Inactive) [ 29/Aug/11 ]

Nikitas, I've pushed in most of your patches except those LASSERT_SPIN_LOCKED():

  • I think locking logic here is simple enough so we don't really need these assertions
  • service::srv_rq_lock is a high contention lock, overhead of LASSERT_SPIN_LOCKED() is not very cheap for this kind of lock.

Thanks

Comment by Nikitas Angelinas [ 03/Nov/11 ]

Hi Liang,

I've started working on an object-based policy; the plan is for it to act as a high-level elevator; I know Eric has said he is sceptical about this being the best approach for an object-based policy, and whilst of course I value his input, I think everyone is also in agreement that getting some performance data from such a policy (or any other policy) would be a good thing.

I suspect the OSD-restructuring work will mean that at some point this policy will need some porting work (e.g. at least to transfer from objid to FID) but at the moment I am doing this against your b_nrs branch.

Is there a branch or future release that is being targeted for landing the NRS code? In one of your last emails you mentioned there were some thoughts on landing the framework (I suspect with fifo policy only?) at around 2.2 or so; is this still the case? In general, how do you see landing of NRS happening? Will you make use of the current NRS development branch, or will you treat this as only a prototype in order to get performance data, and then rework the branch before considering landing? (sorry for all the questions)

There are some things I wanted to have patched on the current branch, some of these are:

  • improve statistics for policies (I noticed there is a FIXME on the current implementation being just temporary)
  • improve lprocfs interactions
  • optionally, to allow to not register all policies for every service, by having each service denote what policies it wants to support at service initialization time
  • have a method for selecting a startup policy for a given service (i.e. at present a newly inserted server would start with a fifo policy and have to rely on lprocfs interaction to change this, maybe doing this automatically at startup would be better-suited to a production version)
  • some code cleanups

I'm more than happy to generate some patches for these features if you also agree they would be useful to have.

Cheers,
Nikitas

Comment by Liang Zhen (Inactive) [ 04/Nov/11 ]

Nikitas, I just merged master into b_nrs and I will continue to use the developing branch for future work, we are still considering to land framework into 2.2 but it really depends on our review/testing resource, I will let you know if there is any update on this.

I think your plan is good, thanks for let me know. I'd like to know more details but I have to prepare for my trip tomorrow, probably I will ask you some detail questions about these plans when I get more time, .

Liang

Comment by Nikitas Angelinas [ 01/Mar/12 ]

I ran acceptance-small testing at some point against a version of the NRS code that was rebased against a recent commit in the lustre-rel repository; some failures can be seen in the first output file attached, but they are all either known failures which are being worked on (to the point that resources permit) by other people within Xyratex and will be submitted when resolved, or have passed in the second round of tests (second attached output file, perhaps they had something to do with the testing environment used).

Comment by Nikitas Angelinas [ 01/Mar/12 ]

I have hit a bug while using mdtest to carry out a performance regression test of vanilla code vs NRS with mainly FIFO policy, and also CRR/CRR2; I can only reproduce this using the CRR policy on the MDS:

LustreError: 1607:0:(ptlrpc_nrs.c:222:nrs_policy_put_locked()) ASSERTION(policy->pol_ref > 0) failed
LustreError: 1607:0:(ptlrpc_nrs.c:222:nrs_policy_put_locked()) LBUG
Kernel panic - not syncing: LBUG

I also hit another bug that may be related, but only once after reproducing the previous one many times, and can't reproduce this one.

LustreError: 22518:0(ptlrpc_nrs.c:366:nrs_request_poll()) ASSERTION(nrs_request_policy(nrq) == policy) failed
LustreError: 22518:0(ptlrpc_nrs.c:366:nrs_request_poll()) LBUG
Kernel panic - not syncing: LBUG

I'll see if I can fix these but I'm more focused on getting some larger-scale testing under way and finishing an ORR DLD at the moment.

Comment by Nikitas Angelinas [ 01/Mar/12 ]

Hi Liang,

We are in the process of running some NRS performance tests in-house, and maybe more importantly, putting together a test plan that we can execute in one of our beta testing sites; they have a large number of clients available, and we could potentially make use of a good number of these; I think we will have a better chance of making use of more of their resources if we can come up with some solid test cases that would demonstrate the usefulness of different NRS policies. Our current thinking is along the following lines:

  • CRR/CRR2 on the MDS: demonstrate client fairness, by showing that aggressive MD users can be prevented from slowing down the filesystem for other users. For this, we can saturate the MDS with a few clients doing e.g. unlink operations, and time 'ls -l' from another client on a separate large directory; if successful, NRS with CRR/CRR2 should show an improvement over vanilla code.
  • CRR/CRR2 on the OSS: some clients performing high rate I/O, to the point that the performance seen at another client is decreased e.g. a streaming video application's frame rate drops below acceptable levels; if successful, CRR/CRR2 should at least help with the standard deviation of the frame rate value, and maybe also its mean value.
  • TBRR (not developed yet, but should be a simpler variation of ORR for which I am doing a DLD) on the OSS: TBRR should be able to help in cases where one client is performing I/O to one OST, and many clients are performing I/O to the rest of the OSTs handled by an OSS, to the point that the first OST remains idle for some periods of time during the test run.
  • ORR (only applicable on OSS): a notable use case may be backwards read operations, although perhaps random and hopefully sequential reads from many clients may see an improvement with the ORR policy.

If yourself or anybody else can think of any notable use cases that would be useful in demonstrating the effectiveness of the various policies, please let us know so we can include these in the test plan. I have not developed the NRS framework any further, so testing out multiple (layered) policies that you had suggested is unfortunately not possible with current code, although adding this should not be too time consuming and is probably worth doing if there is a good use case for it; TBRR on the OSS with ORR in each OST that you had suggested sounds good, although maybe it can benefit from the SMP scaling code landing first, as you had mentioned.

Cheers,
Nikitas

Comment by Liang Zhen (Inactive) [ 02/Mar/12 ]

Ni Nikitas, did you make any progress on those LBUGs? I didn't test this for a while, could be bug after merging with master?
I think you only need to run one CRR/CRR2, because CRR is round-robin over each logic client (OSC or MDC), CRR2 is over NID, so I think they will get same result because most likely you will have only one MDC instance on your clients.

Comment by Nikitas Angelinas [ 12/Mar/12 ]

Hi Liang, I have not made any progress on the bugs, as i am focused on getting test results and working on ORR; i hope to find some time to hunt them down, as ideally i wouldn't want testing to hit upon them; if i remember correctly, i could also replicate the first assertion with the latest liang/b_nrs branch (i can't replicate the second assertion with any branch), but i have not tried to replicate it with older snapshots, so i guess it may have been introduced by a merge with master as you mention, possibly the latest one (btw, i think lustre-dev has not synced with rel for a while).

I am attaching a patch that adds ORR with logical offset ordering, i will work on the physical offset ordering using fiemap calls hopefully within the next few days; if you can have a look at it that would be great. I have only ran a smoke test on the patch on a virtual machine, but will test it on a real setup today or tomorrow if nothing comes up; there is a minor issue of slab object deallocation for backend-fs object data that i haven't taken care of (currently they will just stay allocated until the cache is destroyed, easy way to do it is to stop the policy temporarily and then re-start it for another test run); i may add an LRU or something similar for this shortly.

Comment by Nikitas Angelinas [ 04/Apr/12 ]

I am attaching v2 of the ORR patch which includes support for request ordering via their physical offsets on-disk, obtained via fiemap calls, and will also reorder write RPCs based on their logical offsets (I may add an lprocfs tunable to turn support for writes on and off). I'll try and find a case where this policy helps performance, although knowing the physical offsets, it may make sense to tweak the code slightly to produce a Target-based (OST-based) RR version, since it may be better to be able to mix read requests for different objects, based only on their physical offsets; i.e. requests belonging to the same objects may still be fairly distant.

Comment by Nikitas Angelinas [ 13/Apr/12 ]

Hi, I am attaching a patch that also adds a TRR (Target-Based RR, i.e. RR over OSTs) policy that also performs logical or physical offset request ordering. It has been developed as an after-thought on top of the ORR policy, so plain TRR (without offset-based request ordering) will be able to be simplified, when used on a multi-policy NRS environment, e.g. combined TRR on the OSS + and ORR on each OST that Liang had proposed. The patch also adds tunable read/write support, and optimized physical offset calculations for the one extent per RPC case that is currently used. I am seeing positive results in some cases from these policies, and also CRR2, but will expand on this at LUG in about a week's time.

Comment by Andreas Dilger [ 30/Apr/12 ]

Hi Nikitas,
rather than attaching patches here to the Jira ticket, it would be better to submit them to Gerrit, so that it follows the normal patch inspection and testing process, and it makes it much easier to review and track changes. If you don't want the patches considered for upstream submission yet, you can put a comment to that effect in the Git commit description.

I'd also be happy if you could attach a copy of your LUG presentation here, so that the information about NRS is available in a more central location. This will also include the performance metrics, which is important to see when reviewing a major feature patch for landing.

Comment by Nikitas Angelinas [ 01/May/12 ]

Hi Andreas,

Ok, I'll do this soon; as you have guessed correctly, I haven't been uploading the patches to Gerrit as I was not clear on the landing plan for NRS. Since the framework was originally considered (and is in some respects imo) a prototype, we'll have to decide with Liang what updates we would have to make in order to submit the code for landing. But as you point out, it may be best for us to also submit the current code that we have to Gerrit. I'll post my presentation here and details on the testing, although I'm more looking forward to also test the policies we currently have at scale; I'll check the availability of our collaborator site that performed the previous tests, to schedule another round of testing.

Liang, I'll come up with some suggestions on enhancing the framework soon; I think two things that are definitely required are the layered policies that you had mentioned (i.e. multiple policies operating in the same time), and an lprocfs rework.

A much wanted feature that users seem to want from NRS is QoS. I'm currently looking at LU-694 "Job Stats" since it seems like it will provide us with the information necessary to do at least job-based QoS, although the intent of that patch is to export the gathered statistics via lprocfs at the servers, but it does not include an API for internal users, i.e. Lustre subsystems; of course this isn't too much of a problem, it just means that NRS will need to implement that functionality instead. Btw, maybe it would be good to have an open discussion in order to gather requirements for QoS first, in order to try and address the needs of more users; I'll do that soon.

Cheers,
Nikitas

Comment by Nikitas Angelinas [ 09/May/12 ]

I am attaching a copy of the NRS LUG 2012 presentation and a tarball containing output from the test runs contained in the presentation.

The presentation depicts results from the following tests:

1. Performance regression test using the NRS FIFO policy vs a vanilla Lustre deployment; this includes IOR FPP and SSF testing, and mdtest file and directory operations testing, with 128 and 64 physical clients (and also 12 clients for mdtest); all tests were performed using a 1 process per client configuration. One thing to note is that the unexpectedly low mdtest performance in the create and unlink operations with 128 and 64 clients seems to be due to suboptimal RAID configurations on the systems used for the tests; hence it may be useful to repeat the mdtest runs on a sane configuration. These tests showed no noticeable performance regressions between vanilla code and NRS with the FIFO Policy.

2. A range of tests aiming to explore the effect that the CRR-N policy has on throughput. In these tests, groups of dd-processes are used to generate read and write loads between the Lustre filesystem and /dev/zero and /dev/null respectively. 10 filesystem clients take part in this test. These are either used to run 10 dd processes, or 9 of them are used to run 11 dd processes, while the remaining one client is used to run only one dd process; the read and write cases are handled separately, for both these two '#clients/#dd processes' configurations. In each test run, the throughput from each client and overall standard deviation is measured, in order to compare the behaviour of vanilla code vs NRS with the CRR-N policy. One thing to note is that these tests were not performed using widely striped files, so a repeat may be useful. These tests showed an evening-out effect and large standard deviation reduction of the write performance between clients when using the CRR-N policy, but these results were not witnessed on the read performance tests; this may be due to the synchronous nature of reads, although we probably need to examine the behaviour of CRR-N further before we can make a definite comment.

3. A range of tests, mostly using IOR, but also IOzone, for determining the effectiveness of the ORR and TRR policies in increasing read throughput. These tests have been performed on a small scale, using only 14 physical clients, and with a reduced number of ost_io threads (128 on each of two OSS nodes), since read operations generated very few RPCs, and we wanted to at least emulate a saturated server scenario, since the ORR and TRR policies were expected to show a benefit in cases where a number of the requests are 'pending', such that performance is improved by using these improved sorting algorithms (although this may not be strictly necessary). These test results showed that the TRR policy when used with physical offset ordering, seems to be the most promising configuration, since it provided a noticeable improvement in performance for IOR FPP tests using 1 process per client, and also for the IOzone test (again, 1 process per client); the ORR policy also provided an improvement in the performance of IOR FPP when used in an 8 processes per client configuration. However the worrying thing is that in all other cases, the test results showed a drop in performance when using these policies. I think further, and also larger scale testing of these policies is required at this point; this may give us a better indication of how widely applicable these policies could be, although for definite results it seems like it would be useful to obtain feedback from using them on real-world job runs. Their usefulness may even vary depending on the particular job/application. I'll see if we can run some tests using real jobs on our beta-testing collaborator site with these policies.

Apart from test results, the presentation also includes some annotated debug prints from the OSS nodes, as proof that the aforementioned policies actually implement the sorting behaviour that they claim to.

All tests have been performed using a single Xyratex CS3000 SSU; this includes 2 OSS nodes in HA mode (quad core Intel Xeon C5518 @ 1.73 GHz, 16GB DDR3, HT enabled), and 1 MDS (dual hex core Intel Xeon X5670 @ 2.93 GHz, 32GB DDR3, HT enabled), with servers and clients connected via an Infiniband QDR network.

Comment by Nikitas Angelinas [ 03/Jun/12 ]

patch for master at http://review.whamcloud.com/3015

I made some changes mostly to the framework and added the patch to Gerrit. It still needs a few things, so I will be updating the patch in the following days, but it is in a position for reviews and builds to start. I was hoping that this feature could be considered for landing in 2.3, although the feature freeze date is meant to be quite close.

Liang and Eric, this patch also adds your work on NRS and libcfs_heap; please let me know if you want to submit these separately.

Comment by Liang Zhen (Inactive) [ 06/Jun/12 ]

Hi Nikitas,thanks for posting the patch for review, I definitely will review your patch, the problem is 2.3 code freeze is not far away and we still have a bunch of things pending on landing list, and there are some changes definitely conflict with this patch, so it's very likely that we can't land it to 2.3, but again, it's very helpful to submit this patch on Gerrit, because it will be much easier to discuss based on this.

I will let you know my feedback soon.

Thanks

Comment by Nikitas Angelinas [ 07/Jun/12 ]

Hi Liang, please take your time and plan landings and reviews as you see best, no issue with me. I guess you may be referring to project Apus, since I guess the ptlrpc changes there will cause a conflict; I'm sure that's a good thing as i think the locking in ptlrpc could do with a rework with the NRS patch applied anyway. Thanks for letting me know of your status with 2.3.

Comment by Andreas Dilger [ 19/Oct/12 ]

Nikitas, any status on your refactoring of the NRS patch into smaller components for inspection and landing?

Comment by Nikitas Angelinas [ 19/Oct/12 ]

Sorry, i was away for a few days; I've been adding the flexible policy registration ability, and will upload most of the code broken into smaller patches around the start of next week.

Comment by Nikitas Angelinas [ 30/Oct/12 ]

Abandoning previous Gerrit change and breaking down patch into smaller patches, each with its own change; first change is at http://review.whamcloud.com/#change,4411

Comment by Nikitas Angelinas [ 30/Oct/12 ]

libcfs heap Gerrit for master is at http://review.whamcloud.com/#change,4412

Comment by Nathan Rutman [ 20/Nov/12 ]

Xyratex MRP-73

Comment by Nikitas Angelinas [ 05/Dec/12 ]

attaching LAD '12 presentation slides

Comment by Nikitas Angelinas [ 29/Dec/12 ]

additional Gerrit changes for master are, http://review.whamcloud.com/#change,4937 for the CRR-N policy and http://review.whamcloud.com/#change,4938 for the ORR and TRR policies

Comment by Nikitas Angelinas [ 22/Jan/13 ]

I am attaching a first take on the NRS test plan, marked v1.0. I am uploading it in both a pdf format for easier reading and doc in case someone wants to make any changes. Please let me know of anything.

Comment by Nikitas Angelinas [ 23/Jan/13 ]

I had put together a Conceptual Design document for NRS, but that really needs to be updated in order to be of much use. I will try to do this as soon as code update tasks are finished (so hoping to have it ready by the end of the weekend at the latest), and will then attach the document to this ticket.

Comment by Nikitas Angelinas [ 25/Jan/13 ]

I am uploading an updated version of the test plan document, to account for changes in the lprocfs interface that were included in a refresh of the patches that took place today; I have deleted the old version to avoid any confusion.

Comment by Nikitas Angelinas [ 28/Jan/13 ]

As requested by Isaac, I am attaching a Conceptual Design document, that reviewers of the patches may find useful. It is rather short and informal, but should include enough information to assist in getting familiar with the design concepts used. I will try to update this if people think it is too short, although the plan is to eventually embed the information into comment blocks in the code.

Comment by Nikitas Angelinas [ 29/Jan/13 ]

I am attaching a new version of the test plan to reflect a minor lprocfs change (both regular and hp request handling is now enabled by default); I will remove the older version of the test plan to avoid confusion, please let me know if someone needs that for some reason.

Comment by Andreas Dilger [ 01/Feb/13 ]

Nikitas, just to clarify - can you please submit a follow-up patch to address the issues with Isaac's comments on the http://review.whamcloud.com/4411 patch.

Comment by Nikitas Angelinas [ 01/Feb/13 ]

Yes Andreas, I will submit that patch asap.

Comment by Andreas Dilger [ 01/Feb/13 ]

Also, I now see that there is no test that is enabling NRS policies to verify that they currently work, and continue to work in the future. This is something that the patch inspectors should have caught.

Please submit a sanityn.sh test that enables each of the available policies in turn, and then runs some kind of test load on the multiple mount points (e.g. iozone and racer and fsx for 60s or 600s depending on SLOW=no or SLOW=yes) so that there will be sufficient loads to give NRS a workout.

Comment by Oleg Drokin [ 11/Mar/13 ]

Just to recapture discussions that are happening in the http://review.whamcloud.com/#change,5274 patch:

  • new api needs to be introduced to fetch a next request from a queue if there's any ready for being served according to a policy function. This will also remove the request from the queue right away and should be suitable for use in ptlrpc_request_Get or even as a drop-in replacement of it. ptlrpc_request_get should not require any locking when called.
    (the current poll/remove API still remains for use in other place, currently that's health_check function where we just assess time next ready request spent waiting in the queue, but we don't want the request removed (ALSO need to ensure request has some sort of a reference to avoid a race where we fetch a request and then it's processed and freed before we have a chance to look inside, when using such an API).
  • the HP-request logic needs to be totally removed from ptlrpc request API (with the exception of still providing a function to determine if a request is HP or not), all this logic must be implemented inside of the policy functions as being best positioned to determine request order. The multi-queue fetch/get functions in nrs should be gone as a result and the only fetching calls remaining would be "get next request so I can serve it" and "get pointer to next request for inspection, don't remove from queue". All hp tracking including how many of what sort of requests are already running probably should be inside of a policy function too, though we might need a way for a policy function to determine number of idle thread/running normal requests or some such.
Comment by Alexey Shvetsov [ 07/May/13 ]

After merging http://review.whamcloud.com/4938 server want build with new compilers (tested gcc-4.7 and gcc-4.8)

/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c: In function ‘nrs_orr_ctl’:
/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:773:2: error: case value ‘33’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
[-Werror=switch]
/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:781:2: error: case value ‘34’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
[-Werror=switch]
/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:788:2: error: case value ‘35’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
[-Werror=switch]
/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:795:2: error: case value ‘36’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
[-Werror=switch]
/var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c:802:2: error: case value ‘37’ not in enumerated type ‘enum ptlrpc_nrs_ctl’
LD /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/fid/built-in.o
[-Werror=switch]

Comment by Nikitas Angelinas [ 07/May/13 ]

It seems like the compilers are catching a type check on the enum; I can upload a one/few-liner patch to fix this; it seems like the -Werror=switch is catching errors that might be meant to be caught by -Wswitch-enum. Or I might be reading a documentation for a previous version or something similar, please let me double-check.

Comment by Alexey Shvetsov [ 07/May/13 ]

Seems http://review.whamcloud.com/6141 should fix this error from LU-3179

Comment by Nikitas Angelinas [ 07/May/13 ]

Great, thanks for finding that.

Comment by Liang Zhen (Inactive) [ 12/Sep/13 ]

hmm, sorry I was trying to close it because I didn't realise there are sub tickets for this...

Generated at Sat Feb 10 01:06:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.