[LU-6179] Lock ahead - Request extent locks from userspace Created: 29/Jan/15 Updated: 28/Apr/21 Resolved: 21/Sep/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | bgti, patch | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 17290 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
At the recent developers conference, Jinshan proposed a different method of approaching the performance problems described in Instead of introducing a new type of LDLM lock matching, we'd like to make it possible for user space to explicitly request LDLM locks asynchronously from the IO. I've implemented a prototype version of the feature and will be uploading it for comments. I'll explain the state of the current version in a comment momentarily. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 29/Jan/15 ] |
|
As suggested by Jinshan and Andreas, this shares the machinery currently used for glimpse locking. That machinery is renamed to cl_request_lock/cl_request_lock0. I am not sure about the naming of the functions, but couldn't think of anything better. Code has been lightly tested & basic functionality verified. LDLM extent locks can be requested from userspace with the additional ioctl, and they remain in the client lock cache after the request. Further question - Given that a new LDLM flag has been added, does some sort of compatibility flag need to be added? If so, where and how? |
| Comment by Gerrit Updater [ 29/Jan/15 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/13564 |
| Comment by Patrick Farrell (Inactive) [ 04/Feb/15 ] |
|
Jinshan - Wondering about this: It would be nice to somehow tell user space whether or not their locks conflicted. It looks like the -EWOULDBLOCK is returned from the server back to the client. So how about something like: Then in cl_glimpse_lock:
Then, in the lock ahead code, write it to a "result" field in each of the user provided extents. Thoughts? |
| Comment by Patrick Farrell (Inactive) [ 04/Feb/15 ] |
|
Ah, mistake on my part. AGL locks do not have CEF_NONBLOCK set. So no need to have any extra handling in cl_glimpse_block for -EWOULDBLOCK. The only special handling is in ll_lock_ahead, which will continue on to the next extent if a particular lock request receives -EWOULDBLOCK. |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/15 ] |
|
A quick note to my previous: |
| Comment by Patrick Farrell (Inactive) [ 12/Feb/15 ] |
|
A question for anyone inclined to think about it... I'm having trouble with the fact that there's still a reference on locks that have not been granted yet. So if I try an unmount while an asynchronously requested DLM lock has not yet been granted, there's a dangling reference that prevents unmounting. Any thoughts? There's a deadlock on the server that I'm trying to sort out that caused me to see this, but resolving that doesn't solve the fundamental problem of unmounting with an as-yet-incomplete async lock request. Do AGL locks have some way to handle this I've missed or broken? |
| Comment by Patrick Farrell (Inactive) [ 19/Jun/15 ] |
|
It's been a while here, but the code is ready. The current patch set at http://review.whamcloud.com/#/c/13564/ is ready for review. I've fixed all of the bugs, etc, I'm currently aware of. It'd still be great to try to land this for 2.8, so I'd like to ask for review soon if possible. |
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/15 ] |
|
I will take a look at this patch. Thanks, |
| Comment by Andreas Dilger [ 28/Sep/15 ] |
|
Patrick, what do you think about using the llapi_ladvise() interface from http://review.whamcloud.com/10029 " Having a single ladvise interface (WILLWRITE, WILLNEED, or maybe rename/alias this to WILLREAD; RANDOM) would avoid complexity for application writers. Thoughts? |
| Comment by Patrick Farrell (Inactive) [ 30/Oct/15 ] |
|
Yeah. Sorry for the delay in getting back to you... I've got a better write up of this that I left elsewhere, which I'll put in Gerrit later today as a reply to you and Vitaly. Briefly, that only gets us one of the two benefits of lock ahead. That avoids the lock exchange situation, where clients take turns holding a large lock on the file. The total RPC traffic is the same, since the clients had to ask for the locks, but a client can ask for many locks fairly quickly and overlap those requests. So when a write request actually comes in from user space, the client already has the relevant lock and can begin to write immediately. Not having that - which would be the case if we just used ladvise to inform the servers of our intentions - slows things down significantly for an individual client. In that case, it might be possible to overcome that by adding more clients, since we're no longer undergoing lock exchange. I still like the explicit lock ahead interface because it's more efficient, but I am trying some benchmarking at Cray to see if something like this - Which I can imitate fairly well just by using request only locking without lock ahead - is workable. (I did benchmark it last year and got very poor results, but I'm now wondering if there wasn't something wrong with that testing, since the results were much worse than I can easily explain.) |
| Comment by Cory Spitz [ 03/Dec/15 ] |
|
We should open an LUDOC ticket to track doc updates for this policy. |
| Comment by Andreas Dilger [ 04/Dec/15 ] |
|
Your comments about the benefits of lockahead vs ladvise lead me to think that we may be talking past each other, since in my mind the end results of the two interfaces will be the same. I'm not suggesting that the use of ladvise is intended to change the semantics of lockahead to only advise the OSS of our intended file access pattern. Rather, the LADVISE_WILLWRITE (or maybe LADVISE_WILLWRLOCK?) advice could prefetch one or more LCK_PW locks to the client for each specified extent like your lockahead implementation does. The main difference would be the userspace API for "lockahead" would change to be "ladvise" so that it is consistent with other types of advice that an application might give to the filesystem that isn't available via existing kernel APIs. Since ladvise is a Lustre-specific interface we can specify the advice and semantics as we see fit, so that higher-level libraries can better convey their intentions to Lustre. My main goal for suggesting this is to avoid having multiple different ways of passing higher-level file access advice from userspace to the filesystem, rather than giving application/library developers a single interface that they can use for different IO patterns (both read and write, sequential and strided and random, synchronous and asynchronous), possibly at the same time on different parts of the same file. As I wrote previously, I'm not dead set on this path, I just wanted to make sure that we are talking about the same thing before we disagree on the solution. |
| Comment by Patrick Farrell (Inactive) [ 18/Dec/15 ] |
|
OK, I think I perhaps follow you better now. I can see fitting lock ahead in to the ladvise model; much of what's needed is there. However, it would require some notable changes to ladvise, however. It currently sends a special type of request over to the OST, rather than making a lock request. Would we split ladvise at a low level to make lock requests (when doing lock ahead or similar) instead of using the OST_LADVISE requests as implemented in osc_ladvise_base? I don't see, at that level, how I could adapt the ladvise infrastructure to requesting locks without just splitting it. There's still a lot of value in having just one interface, but I like it a lot less if I'm only partly using ladvise to make the requests. So I'm still on the fence, partly depending on your thoughts on the above. |
| Comment by Patrick Farrell (Inactive) [ 18/Dec/15 ] |
|
Ah, and one more thought, more pragmatically: ladvise is not landed yet. |
| Comment by Andreas Dilger [ 23/Dec/15 ] |
|
My interest in using the ladvise API is to be consistent in userspace for LADVISE_WILLREAD and LADVISE_WILLWRITE. I'm a bit unhappy that the RPC transport would be different between the two, but I don't think that is fatal. I'd rather use the existing LDLM infrastructure for lock-ahead, instead of shoe-horning it into OST_LADVISE. I expect that the ladvise code will be landed early on with 2.9 so I don't think that would be an obstacle. |
| Comment by Patrick Farrell (Inactive) [ 23/Dec/15 ] |
|
All right. I'm certainly game for rebasing on ladvise, though that will come a bit later as I'll have to adapt ladvise slightly to fit the new functionality. I'm not sure I understand exactly what you're getting at for LADVISE_WILLREAD and LADVISE_WILLWRITE (though I agree with the larger goal of having only one interface as much as possible). Are you talking about that just in terms of how the lock mode is specified? |
| Comment by Andreas Dilger [ 24/Dec/15 ] |
|
I'm thinking lockahead would mostly be related to WILLWRITE. The current WILLREAD support for ladvise is prefetching the data into cache on the OSS side, and I don't know if that makes sense to integrate into lockahead or not? Since the DLM locking for reads isn't going to conflict like it does for writes, the only benefit I can see is that it makes sense to cancel potentially conflicting write locks on the region WILLREAD is covering, unless the lock holder is the requesting node. |
| Comment by Vitaly Fertman [ 25/Dec/15 ] |
|
afaics, the discussion in Having said that, lock ahead is relevant to fadvise and it seems it should stay separate from ladvise, until we want to re-implement fadvise functionality through ladvise as well. otherwise, the purpose of ladvise becomes very confusing. however, It does not mean you cannot create 2 separate lfs commands sharing the code and the same ioctl between them. |
| Comment by Andreas Dilger [ 26/Dec/15 ] |
|
I don't have any confidence that the upstream fadvise() call will be changed in a way that is useful to Lustre any time soon. Using ladvise for lockahead makes sense to me, since it is essentially telling the server that the client will be writing the given ranges in the near future, and to optimize this the result is to pass write locks to the client if available. Since the lockahead code is advisory, it isn't actually requiring the locks to be sent (e.g. if conflicting with other clients holding the lock) so it fits the current API reasonably well. |
| Comment by Patrick Farrell (Inactive) [ 14/Jun/16 ] |
|
Andreas made an excellent comment in the review (which I can't reply to there because I'm in the middle of replying to the other comments): "Do you think "lockahead" is the right name for this advice, or "willwrite"? Not necessarily a request to change it, but an open question. Note: I'd prefer "lockahead" over "lock_ahead" since the other advices are also a single word. That's a good point. Lock ahead isn't a very good name for the feature, it's just a name for a particular use of the feature. I'm leery about naming it WILLWRITE, though. It's possible to request READ locks via this mechanism (yeah, there's no obvious application, but it's possible). I also thought it could be used for requesting group locks, potentially. How about "lockrequest/LU_LADVISE_LOCKREQUEST"? I will similarly rename the various other bits. |
| Comment by Patrick Farrell (Inactive) [ 14/Jun/16 ] |
|
Note that perhaps a request to fallocate could be included as a flag which would generate an actual ladvise RPC, as opposed to the lock request RPC generated by lockahead. |
| Comment by Patrick Farrell (Inactive) [ 24/Feb/17 ] |
|
The attached files are instructions for building, installing, and testing MPICH with lockahead, and two things you need for that process - A fully built and up-to-date package of autotools binaries, and the patch to ANL MPICH which enables lockahead via ladvise. (This patch also includes support for the older ioctl based interface.) The patch is NOT the final version that will be submitted to ANL (it needs cleaning up), but it has the same functionality, and does work correctly. |
| Comment by Andreas Dilger [ 29/Mar/17 ] |
|
Cong, would it be possible for you to test the attached MPICH patch to see what kind of performance improvements you can get. This would need to use a version of Lustre built with patch https://review.whamcloud.com/13564 " Patrick, One thing I note is that lockahead is only enabled when explicitly requested in the ADIO hints. That potentially limits the value of this feature to a very small number of users that know about setting the lockahead hints, rather than helping many users by default. |
| Comment by Patrick Farrell (Inactive) [ 29/Mar/17 ] |
|
Andreas, I'll float the idea of turning it on by default, but I'm pretty sure the Cray libraries people will feel strongly that it should be limited to use with hints to use with hints, particularly because in some of the simplest cases (single writer per stripe) it will reduce performance. We could try to bake in enough smarts to avoid those cases, but at least so far, the preference of the libraries people has been to keep it with the hints. About the Lustre contrib patch: from looking at it as it is in the source currently, I believe that patch has been integrated into MPICH. It's actually probably time to remove the existing one, but it would be reasonable to add a new one enabling lock ahead. I'll see about getting an updated version from our library developers. Unless it comes in right away, I will probably make that a separate commit from the current one. |
| Comment by Andreas Dilger [ 29/Mar/17 ] |
|
Patrick, sorry I wasn't meaning that you should do anything with the old patches in lustre/contrib, rather that you should add the new lockahead MPICH patch there. It would be great if you could delete the old MPICH patches at the same time, since they are long obsolete. It looks like the lustre/contrib/README file also needs an update. |
| Comment by Cong Xu (Inactive) [ 29/Mar/17 ] |
|
Hi Andreas, Sure, I will run the test. Cong |
| Comment by Patrick Farrell (Inactive) [ 13/Apr/17 ] |
|
@JamesNunez - The attached are the Cray produced documents on the lockahead specific testing we did internally. It's also seen some testing with applications, load testing, and of course the sanity tests. |
| Comment by Patrick Farrell (Inactive) [ 15/May/17 ] |
|
Cong, I'm looking at your test results, and since the two ways of running gave almost identical results, I think we've got a problem, possibly a bottleneck somewhere else. (There could be a bug in the MPICH or Lustre side as well causing lockahead not to activate, but I did test both, so we'll assume no for the moment.) First: What happens if you try just 4 processes and 4 aggregators, no lockahead? What's the result look like? That should avoid lock contention entirely and give better results... But I bet we're still going to see that same 2.6 GB/s final number What does 1 aggregator do with a 1 stripe file? What about 2 aggregators with a 1 stripe file, with and without lockahead? And what about what should probably be the maximum performance case, 8 process FPP without collective I/O? |
| Comment by Cong Xu (Inactive) [ 30/May/17 ] |
|
Hi Patrick, Thanks for the comments! We have conducted 3 tests: perfect scenario, varying number of processes and varying Lustre Stripe Size. “LockAheadResults.docx” documents the details. [Test 1] In the perfect scenario, we launch 4 Processes on 4 Lustre Clients (1 Process per Client), accessing 4 Lustre OSTs remotely. Both Original and Lock Ahead cases deliver 2700MB/s bandwidth. This is the maximum bandwidth of our Lustre file system. (Section 2.1 Perfect Scenario (Independent I/O)) [Test 2] To conduct the test that the Lock Ahead code should probably deliver superior performance than the original code, we launch up to 512 processes to perform independent I/O to our Lustre file system. The bandwidth of both Original and Lock Ahead cases are 2000MB/s. (Section 2.2 Vary number of processes (Independent I/O)) [Test 3] We also have investigated the effects of various Lustre Stripe Size on the I/O performance. We keep IOR Transfer Size constant (4MB), and increase the Lustre Stripe Size from 1MB to 64MB, Both Original and Lock Ahead cases deliver 2000MB/s bandwidth. (2.3 Vary Lustre Stripe Size (Independent I/O)) |
| Comment by Patrick Farrell (Inactive) [ 30/May/17 ] |
|
If you're getting the maximum bandwidth already without lockahead, then it's definitely not going to help. I don't completely follow your description of the patterns, but that's OK. Can we try simplifying? Let's try 1 stripe, 1 process from one node. What's the bandwidth #? If we've got everything set up right and the OSTs are fast enough for this to matter (I think they may not be), then the second case should be slower than the first (and lockahead should help). But it looks like each OST is capable of ~600-700 MB/s, that may not be enough to show this, depending on network latency, etc. I would expect to see the effect, but it might not show up. We make use of this primarily on much faster OSTs. (3-6 GB/s, for example) So if it doesn't show up, maybe you could try RAM backed OSTs? Thanks! |
| Comment by Andreas Dilger [ 30/May/17 ] |
|
Cong, the lockahead code will only show a benefit if there is a single shared file with all threads writing to that file. Otherwise, Lustre will grant a single whole-file lock to each client at first write, and there is no lock contention. |
| Comment by Cong Xu (Inactive) [ 30/May/17 ] |
|
Hi Andreas, Yes. In my second test, I launch 512 processes on 8 Lustre Clients (64 Processes/Client) to write a single shared file, there should be lock contentions in Lustre. |
| Comment by Patrick Farrell (Inactive) [ 30/May/17 ] |
|
Cong, Yes, that's true, but with that many processes and so few (and relatively slow) OSTs, you may not see any difference. For example, in this case, your OSTs are (best case) capable of 2700 MB/s total. That means each process only needs to provide 42 MB/s of that, and each node only ~340 MB/s. That means per OST, each node only needs to provide ~85 MB/s. That's not much, so I'm not surprised lockahead isn't giving any benefit. Lockahead is really for situations where a single OST is faster than one client can write to it. One process on one client can generally write at between 1-2 GB/s, depending on various network and CPU properties. So these OSTs are quite slow for this testing. So, this testing is sensitive to scale and latency issues. Are you able to do the small tests I requested? They should shed some light. |
| Comment by Jinshan Xiong (Inactive) [ 30/May/17 ] |
|
Let's reduce the number of processes per client and see how it goes. For example, let's do 1 process per client with 8 clients, and then 2 processes per client, etc. |
| Comment by Cong Xu (Inactive) [ 30/May/17 ] |
|
Hi Jinshan, Thanks for the suggestions! Yes, in our second test (Section 2.2 Vary number of processes (Independent I/O)), we started from 1 process per client with 8 clients (total 8 processes) to 64 processes per client with 8 clients (total 512 processes). Hi Patrick, Yes, we have tried simple test as you suggested. Please have a look at the results in section 2.4: Simple Test (1 process and 2 processes accessing a single shared file on one OST). |
| Comment by Patrick Farrell (Inactive) [ 06/Jun/17 ] |
|
Cong, Sorry to take a bit to get back to you. Given the #s in section 2.4, you're barely seeing the problem, and lockahead does have some overhead. I wouldn't necessarily expect it to help in that case. It would be much easier to see if you had faster OSTs - So, I'd like to request RAM backed OSTs. It's also possible something is wrong with the library. While I think we'll need RAM backed OSTs (or at least, much faster OSTs) to see benefit, we can explore this possibility as well. Let's take one of the very simple tests, like a 1 stripe file with 1 process per client on 2 clients. I assume you're creating the file fresh before the test, but if not, please remove it and re-create it right before the test. Then, let's look at lock count before and after running IOR (add the -k option so the file isn't deleted, otherwise the locks will be cleaned up). Specifically, on one of the clients, cat the lock count for the OST where the file is: Before and after the test. If the file is not deleted and the lock count hasn't gone up, lock ahead didn't work for some reason. Again, I think we'll need RAM backed OSTs regardless... But this would be useful even without that. |
| Comment by Patrick Farrell (Inactive) [ 06/Jun/17 ] |
|
Slides and paper from Cray User Group 2017 attached. They contain real performance #s on real hardware, including from real applications. Just for reference in case anyone is curious. |
| Comment by Cong Xu (Inactive) [ 06/Jun/17 ] |
|
Hi Patrick, Thanks for the great suggestions! We conducted more tests recently and was able to demonstrate the power of Lock Ahead in test "2.3 Vary Lustre Stripe Size (Independent I/O)". In this test, the transfer size of each process is configured to be 1MB and the stripe size grows from 256KB to 16MB. When the stripe size equals to 16MB, 16 processes write a single stripe simultaneously, leading to lock contention issues. In this test, Lock Ahead code performs 21.5% better than original code. |
| Comment by Patrick Farrell (Inactive) [ 06/Jun/17 ] |
|
Huh, OK! That's a clever way to show it. A faster OST will show much larger benefits, of course. |
| Comment by Cong Xu (Inactive) [ 06/Jun/17 ] |
|
Hi Patrick, In the very simple test you suggested, the lock count does increase from 0 to 4010, so the lock ahead works well. |
| Comment by Mikhail Pershin [ 20/Jun/17 ] |
|
Patrick, the new test 255c in sanity.sh reports the following: == sanity test 255c: suite of ladvise lockahead tests ================================================ 04:54:04 (1495688044) Starting test test10 at 1495688045 Finishing test test10 at 1495688045 Starting test test20 at 1495688045 cannot give advice: Invalid argument (22) cannot give advice: Invalid argument (22) cannot give advice: Invalid argument (22) cannot give advice: Invalid argument (22) cannot give advice: Invalid argument (22) cannot give advice: Invalid argument (22) Finishing test test20 at 1495688045 Is that expected or test isn't working properly? This is from the latest test results https://testing.hpdd.intel.com/sub_tests/38e1f10a-4129-11e7-91f4-5254006e85c2
|
| Comment by Patrick Farrell (Inactive) [ 20/Jun/17 ] |
|
Oh. Hm. No - It's skipping some of the tests. Sorry about that, thanks for pointing it out. Some development stuff I was doing escaped in to what I pushed upstream... I'll fix that when I rebase to merge. |
| Comment by Gerrit Updater [ 21/Sep/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/13564/ |
| Comment by Peter Jones [ 21/Sep/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 01/Apr/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38109 |
| Comment by Gerrit Updater [ 06/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38109/ |
| Comment by Gerrit Updater [ 08/Apr/20 ] |
|
Patrick Farrell (farr0186@gmail.com) uploaded a new patch: https://review.whamcloud.com/38179 |
| Comment by Gerrit Updater [ 28/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38179/ |