[LU-398] NRS (Network Request Scheduler ) Created: 07/Jun/11 Updated: 07/Oct/16 Resolved: 07/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Liang Zhen (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bugzilla ID: | 13,634 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 7604 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
Architecture - Network Request Scheduler(NB: this is copy of http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler) Definitions
|
| Comments |
| Comment by Nathan Rutman [ 05/Jul/11 ] |
|
Note that Xyratex is also working on this issue (MRP-73). |
| Comment by Nathan Rutman [ 05/Jul/11 ] |
|
Xyratex NRS HLD |
| Comment by Liang Zhen (Inactive) [ 08/Jul/11 ] |
|
Hi, here is my prototype and it has full implementation of binheap and client-round-robin based on binheap. It's not a full tested patch. |
| Comment by Liang Zhen (Inactive) [ 08/Jul/11 ] |
|
I just tested the previous patch and found it has a bug, so I fixed it. |
| Comment by Nikitas Angelinas [ 11/Jul/11 ] |
|
Hi Liang, There's a minor typo in line 1242 in crr_req_compare(), the second req1 should be req2. |
| Comment by Liang Zhen (Inactive) [ 11/Jul/11 ] |
|
Nikitas, thanks! it will change the whole logic though it's just a typo! I'm doing some clean up now, after that will push it to a developing branch for your review(in one or two days), any comment and suggestion will be welcome. Also, I will review your DLD to see if I can merge ideas about framework into the prototype (I wouldn't be able to work on OBRR recently). Thanks again! |
| Comment by Nikitas Angelinas [ 12/Jul/11 ] |
|
Hi Liang, If you are willing to upload your code into a development repository, that sounds great. I have gone through some of the patch you posted, and I don't think there is anything essentially wrong with it; maybe some minor things that the patch could make use of, but as you said, it is meant to be a prototype. I will follow up with a more detailed discussion soon in the lustre-devel forum. I will get some time to factor in inspection comments for my dld, so should be able to upload a new version here soon. Thanks! |
| Comment by Liang Zhen (Inactive) [ 13/Jul/11 ] |
|
I've uploaded the patch to git://git.whamcloud.com/fs/lustre-dev.git branch name is liang/b_nrs link to it: |
| Comment by Nikitas Angelinas [ 14/Jul/11 ] |
|
Slightly more up-to-date version of our dld; I'll follow this up with a proper version once I factor in all inspection comments. |
| Comment by Liang Zhen (Inactive) [ 22/Jul/11 ] |
|
I will update the developing branch to make it consistent with the slides. Thanks |
| Comment by Nikitas Angelinas [ 23/Jul/11 ] |
|
Liang, please use the latest patch file I uploaded; it is the most recent version of our DLD. I will upload a new version if we make any changes on that. Thanks |
| Comment by Liang Zhen (Inactive) [ 02/Aug/11 ] |
|
I'm thinking that we probably can remove these whole bunch of things and put them all into NRS, we can make ptlrpc service to be cleaner and more readable by doing this, it's just a preliminary thinking and I just put a note at here so I wouldn't forget it in the future: include/lustre_net.h cfs_spinlock_t srv_rq_lock __cfs_cacheline_aligned;
/** # reqs in either of the queues below */
/** reqs waiting for service */
cfs_list_t srv_request_queue;
/** high priority queue */
cfs_list_t srv_request_hpq;
/** # incoming reqs */
int srv_n_queued_reqs;
/** # reqs being served */
int srv_n_active_reqs;
/** # HPreqs being served */
int srv_n_active_hpreq;
/** # hp requests handled */
int srv_hpreq_count;
|
| Comment by Liang Zhen (Inactive) [ 02/Aug/11 ] |
|
Nikitas, Liang |
| Comment by Nikitas Angelinas [ 02/Aug/11 ] |
|
Liang, That makes perfect sense to me. I suspect nrs heads may be the most appropriate place for replacements to those fields, although the spinlock may require a separate (common to hp and normal requests) structure, or perhaps to be left in ptlrpc_service? I suspect that some APIs could be moved into NRS; were there some you were thinking of in particular? Nikitas |
| Comment by Nikitas Angelinas [ 12/Aug/11 ] |
|
Hi Liang, I have gone through some of the updated code in the development repository. In general I am personally quite happy with what is there; I think the resource abstraction is more intuitive than the pair of target/object. I'm attaching some small patches; obviously please review before applying. I will be away on holiday for the rest of the month, so unfortunately not able to do much more work on this at present, but will complete the review as soon as I come back, hope that is ok. Thanks, |
| Comment by Liang Zhen (Inactive) [ 15/Aug/11 ] |
|
Nikitas, thanks, I will review & push these patc hes into the developing branch, btw, Robert told me you should be able to push code to the developing branch now. As I mentioned previously, the next thing I'm going to work on is replacing active RPC counters from ptlrpc/service.c with counters in RPC, which will made code cleaner. Liang |
| Comment by Liang Zhen (Inactive) [ 29/Aug/11 ] |
|
Nikitas, I've pushed in most of your patches except those LASSERT_SPIN_LOCKED():
Thanks |
| Comment by Nikitas Angelinas [ 03/Nov/11 ] |
|
Hi Liang, I've started working on an object-based policy; the plan is for it to act as a high-level elevator; I know Eric has said he is sceptical about this being the best approach for an object-based policy, and whilst of course I value his input, I think everyone is also in agreement that getting some performance data from such a policy (or any other policy) would be a good thing. I suspect the OSD-restructuring work will mean that at some point this policy will need some porting work (e.g. at least to transfer from objid to FID) but at the moment I am doing this against your b_nrs branch. Is there a branch or future release that is being targeted for landing the NRS code? In one of your last emails you mentioned there were some thoughts on landing the framework (I suspect with fifo policy only?) at around 2.2 or so; is this still the case? In general, how do you see landing of NRS happening? Will you make use of the current NRS development branch, or will you treat this as only a prototype in order to get performance data, and then rework the branch before considering landing? (sorry for all the questions) There are some things I wanted to have patched on the current branch, some of these are:
I'm more than happy to generate some patches for these features if you also agree they would be useful to have. Cheers, |
| Comment by Liang Zhen (Inactive) [ 04/Nov/11 ] |
|
Nikitas, I just merged master into b_nrs and I will continue to use the developing branch for future work, we are still considering to land framework into 2.2 but it really depends on our review/testing resource, I will let you know if there is any update on this. I think your plan is good, thanks for let me know. I'd like to know more details but I have to prepare for my trip tomorrow, probably I will ask you some detail questions about these plans when I get more time, Liang |
| Comment by Nikitas Angelinas [ 01/Mar/12 ] |
|
I ran acceptance-small testing at some point against a version of the NRS code that was rebased against a recent commit in the lustre-rel repository; some failures can be seen in the first output file attached, but they are all either known failures which are being worked on (to the point that resources permit) by other people within Xyratex and will be submitted when resolved, or have passed in the second round of tests (second attached output file, perhaps they had something to do with the testing environment used). |
| Comment by Nikitas Angelinas [ 01/Mar/12 ] |
|
I have hit a bug while using mdtest to carry out a performance regression test of vanilla code vs NRS with mainly FIFO policy, and also CRR/CRR2; I can only reproduce this using the CRR policy on the MDS: LustreError: 1607:0:(ptlrpc_nrs.c:222:nrs_policy_put_locked()) ASSERTION(policy->pol_ref > 0) failed I also hit another bug that may be related, but only once after reproducing the previous one many times, and can't reproduce this one. LustreError: 22518:0(ptlrpc_nrs.c:366:nrs_request_poll()) ASSERTION(nrs_request_policy(nrq) == policy) failed I'll see if I can fix these but I'm more focused on getting some larger-scale testing under way and finishing an ORR DLD at the moment. |
| Comment by Nikitas Angelinas [ 01/Mar/12 ] |
|
Hi Liang, We are in the process of running some NRS performance tests in-house, and maybe more importantly, putting together a test plan that we can execute in one of our beta testing sites; they have a large number of clients available, and we could potentially make use of a good number of these; I think we will have a better chance of making use of more of their resources if we can come up with some solid test cases that would demonstrate the usefulness of different NRS policies. Our current thinking is along the following lines:
If yourself or anybody else can think of any notable use cases that would be useful in demonstrating the effectiveness of the various policies, please let us know so we can include these in the test plan. I have not developed the NRS framework any further, so testing out multiple (layered) policies that you had suggested is unfortunately not possible with current code, although adding this should not be too time consuming and is probably worth doing if there is a good use case for it; TBRR on the OSS with ORR in each OST that you had suggested sounds good, although maybe it can benefit from the SMP scaling code landing first, as you had mentioned. Cheers, |
| Comment by Liang Zhen (Inactive) [ 02/Mar/12 ] |
|
Ni Nikitas, did you make any progress on those LBUGs? I didn't test this for a while, could be bug after merging with master? |
| Comment by Nikitas Angelinas [ 12/Mar/12 ] |
|
Hi Liang, I have not made any progress on the bugs, as i am focused on getting test results and working on ORR; i hope to find some time to hunt them down, as ideally i wouldn't want testing to hit upon them; if i remember correctly, i could also replicate the first assertion with the latest liang/b_nrs branch (i can't replicate the second assertion with any branch), but i have not tried to replicate it with older snapshots, so i guess it may have been introduced by a merge with master as you mention, possibly the latest one (btw, i think lustre-dev has not synced with rel for a while). I am attaching a patch that adds ORR with logical offset ordering, i will work on the physical offset ordering using fiemap calls hopefully within the next few days; if you can have a look at it that would be great. I have only ran a smoke test on the patch on a virtual machine, but will test it on a real setup today or tomorrow if nothing comes up; there is a minor issue of slab object deallocation for backend-fs object data that i haven't taken care of (currently they will just stay allocated until the cache is destroyed, easy way to do it is to stop the policy temporarily and then re-start it for another test run); i may add an LRU or something similar for this shortly. |
| Comment by Nikitas Angelinas [ 04/Apr/12 ] |
|
I am attaching v2 of the ORR patch which includes support for request ordering via their physical offsets on-disk, obtained via fiemap calls, and will also reorder write RPCs based on their logical offsets (I may add an lprocfs tunable to turn support for writes on and off). I'll try and find a case where this policy helps performance, although knowing the physical offsets, it may make sense to tweak the code slightly to produce a Target-based (OST-based) RR version, since it may be better to be able to mix read requests for different objects, based only on their physical offsets; i.e. requests belonging to the same objects may still be fairly distant. |
| Comment by Nikitas Angelinas [ 13/Apr/12 ] |
|
Hi, I am attaching a patch that also adds a TRR (Target-Based RR, i.e. RR over OSTs) policy that also performs logical or physical offset request ordering. It has been developed as an after-thought on top of the ORR policy, so plain TRR (without offset-based request ordering) will be able to be simplified, when used on a multi-policy NRS environment, e.g. combined TRR on the OSS + and ORR on each OST that Liang had proposed. The patch also adds tunable read/write support, and optimized physical offset calculations for the one extent per RPC case that is currently used. I am seeing positive results in some cases from these policies, and also CRR2, but will expand on this at LUG in about a week's time. |
| Comment by Andreas Dilger [ 30/Apr/12 ] |
|
Hi Nikitas, I'd also be happy if you could attach a copy of your LUG presentation here, so that the information about NRS is available in a more central location. This will also include the performance metrics, which is important to see when reviewing a major feature patch for landing. |
| Comment by Nikitas Angelinas [ 01/May/12 ] |
|
Hi Andreas, Ok, I'll do this soon; as you have guessed correctly, I haven't been uploading the patches to Gerrit as I was not clear on the landing plan for NRS. Since the framework was originally considered (and is in some respects imo) a prototype, we'll have to decide with Liang what updates we would have to make in order to submit the code for landing. But as you point out, it may be best for us to also submit the current code that we have to Gerrit. I'll post my presentation here and details on the testing, although I'm more looking forward to also test the policies we currently have at scale; I'll check the availability of our collaborator site that performed the previous tests, to schedule another round of testing. Liang, I'll come up with some suggestions on enhancing the framework soon; I think two things that are definitely required are the layered policies that you had mentioned (i.e. multiple policies operating in the same time), and an lprocfs rework. A much wanted feature that users seem to want from NRS is QoS. I'm currently looking at Cheers, |
| Comment by Nikitas Angelinas [ 09/May/12 ] |
|
I am attaching a copy of the NRS LUG 2012 presentation and a tarball containing output from the test runs contained in the presentation. The presentation depicts results from the following tests: 1. Performance regression test using the NRS FIFO policy vs a vanilla Lustre deployment; this includes IOR FPP and SSF testing, and mdtest file and directory operations testing, with 128 and 64 physical clients (and also 12 clients for mdtest); all tests were performed using a 1 process per client configuration. One thing to note is that the unexpectedly low mdtest performance in the create and unlink operations with 128 and 64 clients seems to be due to suboptimal RAID configurations on the systems used for the tests; hence it may be useful to repeat the mdtest runs on a sane configuration. These tests showed no noticeable performance regressions between vanilla code and NRS with the FIFO Policy. 2. A range of tests aiming to explore the effect that the CRR-N policy has on throughput. In these tests, groups of dd-processes are used to generate read and write loads between the Lustre filesystem and /dev/zero and /dev/null respectively. 10 filesystem clients take part in this test. These are either used to run 10 dd processes, or 9 of them are used to run 11 dd processes, while the remaining one client is used to run only one dd process; the read and write cases are handled separately, for both these two '#clients/#dd processes' configurations. In each test run, the throughput from each client and overall standard deviation is measured, in order to compare the behaviour of vanilla code vs NRS with the CRR-N policy. One thing to note is that these tests were not performed using widely striped files, so a repeat may be useful. These tests showed an evening-out effect and large standard deviation reduction of the write performance between clients when using the CRR-N policy, but these results were not witnessed on the read performance tests; this may be due to the synchronous nature of reads, although we probably need to examine the behaviour of CRR-N further before we can make a definite comment. 3. A range of tests, mostly using IOR, but also IOzone, for determining the effectiveness of the ORR and TRR policies in increasing read throughput. These tests have been performed on a small scale, using only 14 physical clients, and with a reduced number of ost_io threads (128 on each of two OSS nodes), since read operations generated very few RPCs, and we wanted to at least emulate a saturated server scenario, since the ORR and TRR policies were expected to show a benefit in cases where a number of the requests are 'pending', such that performance is improved by using these improved sorting algorithms (although this may not be strictly necessary). These test results showed that the TRR policy when used with physical offset ordering, seems to be the most promising configuration, since it provided a noticeable improvement in performance for IOR FPP tests using 1 process per client, and also for the IOzone test (again, 1 process per client); the ORR policy also provided an improvement in the performance of IOR FPP when used in an 8 processes per client configuration. However the worrying thing is that in all other cases, the test results showed a drop in performance when using these policies. I think further, and also larger scale testing of these policies is required at this point; this may give us a better indication of how widely applicable these policies could be, although for definite results it seems like it would be useful to obtain feedback from using them on real-world job runs. Their usefulness may even vary depending on the particular job/application. I'll see if we can run some tests using real jobs on our beta-testing collaborator site with these policies. Apart from test results, the presentation also includes some annotated debug prints from the OSS nodes, as proof that the aforementioned policies actually implement the sorting behaviour that they claim to. All tests have been performed using a single Xyratex CS3000 SSU; this includes 2 OSS nodes in HA mode (quad core Intel Xeon C5518 @ 1.73 GHz, 16GB DDR3, HT enabled), and 1 MDS (dual hex core Intel Xeon X5670 @ 2.93 GHz, 32GB DDR3, HT enabled), with servers and clients connected via an Infiniband QDR network. |
| Comment by Nikitas Angelinas [ 03/Jun/12 ] |
|
patch for master at http://review.whamcloud.com/3015 I made some changes mostly to the framework and added the patch to Gerrit. It still needs a few things, so I will be updating the patch in the following days, but it is in a position for reviews and builds to start. I was hoping that this feature could be considered for landing in 2.3, although the feature freeze date is meant to be quite close. Liang and Eric, this patch also adds your work on NRS and libcfs_heap; please let me know if you want to submit these separately. |
| Comment by Liang Zhen (Inactive) [ 06/Jun/12 ] |
|
Hi Nikitas,thanks for posting the patch for review, I definitely will review your patch, the problem is 2.3 code freeze is not far away and we still have a bunch of things pending on landing list, and there are some changes definitely conflict with this patch, so it's very likely that we can't land it to 2.3, but again, it's very helpful to submit this patch on Gerrit, because it will be much easier to discuss based on this. I will let you know my feedback soon. Thanks |
| Comment by Nikitas Angelinas [ 07/Jun/12 ] |
|
Hi Liang, please take your time and plan landings and reviews as you see best, no issue with me. I guess you may be referring to project Apus, since I guess the ptlrpc changes there will cause a conflict; I'm sure that's a good thing as i think the locking in ptlrpc could do with a rework with the NRS patch applied anyway. Thanks for letting me know of your status with 2.3. |
| Comment by Andreas Dilger [ 19/Oct/12 ] |
|
Nikitas, any status on your refactoring of the NRS patch into smaller components for inspection and landing? |
| Comment by Nikitas Angelinas [ 19/Oct/12 ] |
|
Sorry, i was away for a few days; I've been adding the flexible policy registration ability, and will upload most of the code broken into smaller patches around the start of next week. |
| Comment by Nikitas Angelinas [ 30/Oct/12 ] |
|
Abandoning previous Gerrit change and breaking down patch into smaller patches, each with its own change; first change is at http://review.whamcloud.com/#change,4411 |
| Comment by Nikitas Angelinas [ 30/Oct/12 ] |
|
libcfs heap Gerrit for master is at http://review.whamcloud.com/#change,4412 |
| Comment by Nathan Rutman [ 20/Nov/12 ] |
|
Xyratex MRP-73 |
| Comment by Nikitas Angelinas [ 05/Dec/12 ] |
|
attaching LAD '12 presentation slides |
| Comment by Nikitas Angelinas [ 29/Dec/12 ] |
|
additional Gerrit changes for master are, http://review.whamcloud.com/#change,4937 for the CRR-N policy and http://review.whamcloud.com/#change,4938 for the ORR and TRR policies |
| Comment by Nikitas Angelinas [ 22/Jan/13 ] |
|
I am attaching a first take on the NRS test plan, marked v1.0. I am uploading it in both a pdf format for easier reading and doc in case someone wants to make any changes. Please let me know of anything. |
| Comment by Nikitas Angelinas [ 23/Jan/13 ] |
|
I had put together a Conceptual Design document for NRS, but that really needs to be updated in order to be of much use. I will try to do this as soon as code update tasks are finished (so hoping to have it ready by the end of the weekend at the latest), and will then attach the document to this ticket. |
| Comment by Nikitas Angelinas [ 25/Jan/13 ] |
|
I am uploading an updated version of the test plan document, to account for changes in the lprocfs interface that were included in a refresh of the patches that took place today; I have deleted the old version to avoid any confusion. |
| Comment by Nikitas Angelinas [ 28/Jan/13 ] |
|
As requested by Isaac, I am attaching a Conceptual Design document, that reviewers of the patches may find useful. It is rather short and informal, but should include enough information to assist in getting familiar with the design concepts used. I will try to update this if people think it is too short, although the plan is to eventually embed the information into comment blocks in the code. |
| Comment by Nikitas Angelinas [ 29/Jan/13 ] |
|
I am attaching a new version of the test plan to reflect a minor lprocfs change (both regular and hp request handling is now enabled by default); I will remove the older version of the test plan to avoid confusion, please let me know if someone needs that for some reason. |
| Comment by Andreas Dilger [ 01/Feb/13 ] |
|
Nikitas, just to clarify - can you please submit a follow-up patch to address the issues with Isaac's comments on the http://review.whamcloud.com/4411 patch. |
| Comment by Nikitas Angelinas [ 01/Feb/13 ] |
|
Yes Andreas, I will submit that patch asap. |
| Comment by Andreas Dilger [ 01/Feb/13 ] |
|
Also, I now see that there is no test that is enabling NRS policies to verify that they currently work, and continue to work in the future. This is something that the patch inspectors should have caught. Please submit a sanityn.sh test that enables each of the available policies in turn, and then runs some kind of test load on the multiple mount points (e.g. iozone and racer and fsx for 60s or 600s depending on SLOW=no or SLOW=yes) so that there will be sufficient loads to give NRS a workout. |
| Comment by Oleg Drokin [ 11/Mar/13 ] |
|
Just to recapture discussions that are happening in the http://review.whamcloud.com/#change,5274 patch:
|
| Comment by Alexey Shvetsov [ 07/May/13 ] |
|
After merging http://review.whamcloud.com/4938 server want build with new compilers (tested gcc-4.7 and gcc-4.8) /var/tmp/portage/sys-cluster/lustre-9999/work/lustre-9999/lustre/ptlrpc/nrs_orr.c: In function ‘nrs_orr_ctl’: |
| Comment by Nikitas Angelinas [ 07/May/13 ] |
|
It seems like the compilers are catching a type check on the enum; I can upload a one/few-liner patch to fix this; it seems like the -Werror=switch is catching errors that might be meant to be caught by -Wswitch-enum. Or I might be reading a documentation for a previous version or something similar, please let me double-check. |
| Comment by Alexey Shvetsov [ 07/May/13 ] |
|
Seems http://review.whamcloud.com/6141 should fix this error from |
| Comment by Nikitas Angelinas [ 07/May/13 ] |
|
Great, thanks for finding that. |
| Comment by Liang Zhen (Inactive) [ 12/Sep/13 ] |
|
hmm, sorry I was trying to close it because I didn't realise there are sub tickets for this... |