[LU-6529] Server side lock limits to avoid unnecessary memory exhaustion Created: 27/Apr/15 Updated: 06/Mar/23 Resolved: 16/Sep/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
As seen in tickets like We require the servers, the MDS in particular, to instead limit ldlm lock usage to something reasonable to avoid OOM conditions on their own. It is not good design to leave the MDS's memory usage entirely up to the very careful administrative limiting of ldlm lock usage limits across all of the client nodes. Consider that some sites have many thousands of clients across many clusters where such careful balancing and coordinated client limits may be difficult to achieve. Consider also WAN usages, where some clients might not ever reside at the same organization as the servers. Consider also bugs in the client, again like See also the attached graph showing MDS memory usage. Clearly the ldlm lock usage grows without bound, and other parts of the kernel memory usage are put under undue pressure. 70+ GiB of ldlm lock usage is not terribly reasonable for our setup. Some might argue that the SLV code needs to be fixed, and I have no argument against pursuing that work. That could certainly be worked in some other ticket. But even if SLV is fixed, we still require enforcement of good memory usage on the server side. There will always be client bugs or misconfiguration on clients and the server OOMing is not a reasonable response to those issues. I would propose a configurable hard limit on the number of locks (or space used by locks) on the server side. I am open to other solutions, of course. |
| Comments |
| Comment by Peter Jones [ 28/Apr/15 ] |
|
Oleg Can you summarize our discussions on this Thanks Peter |
| Comment by Oleg Drokin [ 28/Apr/15 ] |
|
So, the idea of having a limit on number of locks per client (or a bunch of clients or such ) is a sane one. SLV initially had some goals to that part. The real question here is how do we go about enforcement. There's not too much choice we have here currently: What are your ideas on this front, though? |
| Comment by Christopher Morrone [ 28/Apr/15 ] |
I did not say a per-client limit; that is not what we want. We want a global limit on the server, not a per client one. |
| Comment by Oleg Drokin [ 28/Apr/15 ] |
|
Even with the per-server limit, we'd still need to somehow enforce it returning us to the choices above. |
| Comment by Christopher Morrone [ 29/Apr/15 ] |
Ideally, I think we would send the equivalent of EAGAIN to any client that requests a lock while we are at the limit. The operation would only need to hang until the server has a chance to free up locks and then it would continue to run as normal. I don't understand why you think it would deadlock and require a reboot. That would indeed be a poor design.
We already evict clients that fail to respond to lock revocations, so no change in behavior is needed.
The behavior under lock load is pathological now. We are talking about the server managing its memory reasonably and avoiding a pathological OOM state. I'm not really clear on what new pathologies are worrying you. At a high level I think that we want something like this: 1) There is a lock count limit (or memory usage by locks limit) on the server. |
| Comment by Oleg Drokin [ 30/Apr/15 ] |
|
I see, so basically you want to delegate it to the server to revoke some locks instead of asking client to free what it does not need somewhat voluntarily, pretty much equivalent to #3 on my list. Well, the pathologies I have in mind are not new but are much like what we observe now - clients hogging locks without releasing them when server is in need (as instructed per SLV and such). We know this is broken now, but once servers start forcing lock revocations, we'll stop noticing this, yet it's actually more effective (or so I think) if the clients proactively cancel locks because clients certainly have much better idea about what's used and what's not. But of course on the other hand the server must defend itself from badly acting clients that don't release locks somehow still. |
| Comment by Andreas Dilger [ 13/May/15 ] |
|
Chris, do you already cancel the client's MDT DLM locks at the beginning of a job in the prologue script? Something like: lctl set_param ldlm.namespaces.*MDT*.lru_size=clear That would free up some memory usage on the MDS by cancelling all the MDS locks used by the previous job as a short-term workaround to avoid the need to monitor and restart the MDS so often while we are working on a patch to fix this issue. |
| Comment by Christopher Morrone [ 13/May/15 ] |
|
Yes, thanks, we already do that where practical. Some places there is no such prologue script available and we are taking other workarounds. We would like to not have to use workarounds. |
| Comment by Andreas Dilger [ 14/May/15 ] |
|
Niu, I think there are a couple of potential approaches to fixing this. I think the ultimate goal is to fix the existing "lock volume pool" handling so that clients are asked to cancel locks from their LRU in a timely manner before the MDS is running out of memory. However, it also makes sense to implement a safety net, which will be easier understand and ensure correctness, faster to implement and port to older Lustre releases, and will still be useful in case the "lock volume pool" approach isn't working properly (as it is today). I think the first and easiest thing to look at is whether it is possible to change ldlm_new_lock() to force an "old" lock to be freed (cancelled) if there are "too many" locks. This will ensure that when there are too many locks on the server, one lock will be cancelled for each new lock that is granted. It is straight forward to add a /proc tunable to the server namespace (lock_count_max?) that defines the absolute maximum number of locks. This should default to something reasonable (maybe 1/8 or 1/10 of RAM based on roundup_pow_of_two(sizeof(struct ldlm_lock))) and should be a global percpu_counter so that we don't need to balance lock allocations across namespaces, while still avoiding CPU contention on a single counter. The more difficult part is how to find "old" locks on the server. There is no ideal way to do this, because the server doesn't see which locks are in use by the client, but it can at least make a guess. One option is to grab an old resource from the namespace, and cancel one or more locks from the end of the lr_granted list, which should be roughly in LRU order. The hard part is finding the oldest resource in the namespace, since these are kept in a hash. While not necessarily the best way to find a lock, the cost to any single client should be relatively low, and not worse than any other kind of lock contention. There would be some overhead from locking and searching the namespace, but it is still better than OOM, and if several locks are cancelled at once (e.g. min(100 locks, locks on resource), excluding locks with l_last_activity or l_last_used within 30s[*]) the per-lock overhead can be reduced. The other option is to find some way to pick random resources from the ns_hash, and then cancel the oldest half of the locks. This cannot start at the same place each time, or the same resources would be candidates for cancellation repeatedly. [*] it looks like ldlm_refresh_waiting_lock() should also update l_last_activity, maybe by moving this update into __ldlm_add_waiting_lock so that l_last_activity is updated whenever a client sends an RPC to that object, so this code can make a better decision. Alternately, the server could also update l_last_used on so that the lock scanning can have a better chance of finding an old lock. Another minor improvement that is possible is to shrink the size of ldlm_resource, by filling in holes in the structure. The current size is 320 bytes, with 12 bytes of holes (use "pahole lustre/ptlrpc/ptlrpc.ko" from the "dwarves" RPM to see this). Moving lr_refcount after lr_lock would remove 8 bytes of holes. Removing the unused lr_most_restr field would save another 16 bytes. That would allow 13 ldlm_resource structures to fit into a 4096-byte page instead of only 12. Making lr_itree optional (e.g. allocating it from its own slab in ldlm_extent_add_lock() only for extent locks, and making the pointer to it a union with lr_lvb_inode which is only used on the client) would reduce the size by 128 bytes, and allow 24 ldlm_resource structures per 4096-byte page, saving about 15GB on the MDS in the graph. Shrinking ldlm_lock to fit more per slab would be a bit harder, but would be worthwhile because there are so many locks and would save many MB of space on the server. The ldlm_*callback pointers could be replaced by a single ldlm_callback_suite pointer, since these are always the same, saving 16 bytes. The l_exp_flock_hash and l_tree_node are used on different lock types, and are only used on the server. l_lru, l_conn_export, and l_lvb* only used on the client, while l_export is only used on the server. There are also a number of other fields in ldlm_lock already listed as being client- or server-only Putting these in a union could save a lot of space. This is relatively more complex to do and should be in a separate bug, and discussed with Alex. It wouldn't hurt to try changing ldlm_lock_new() to use GFP_KERNEL for locks allocated on a server namespace. The same could be done for ldlm_resource_new() if it was passed the namespace as an argument. While this won't directly cause lock cancellations from the server, it will allow these threads to call slab shrinkers that may free up other memory and also put pressure on the LDLM slabs to free other client locks via the pool mechanism. I don't think this will cause problems, since lock allocations are not done under memory pressure (all dirty files should already hold a lock when trying to flush pages from memory), so even with a client and server on the same node there shouldn't be a risk of deadlocks. |
| Comment by Niu Yawei (Inactive) [ 14/May/15 ] |
|
Thank you for all these ideas, Andreas. I'll start with these safety approaches. Besides all these approaches, I think we may consider adding a "maximum lock age" to the LRUR policy, which means if a lock is not used for a enough long time, it should be cancelled no matter if it's CLV is greater than SLV, which guarantees client can always cancel some unused locks in timely manner before the SLV is fixed, and I think it does make sense even if the SLV is fixed. |
| Comment by Christopher Morrone [ 14/May/15 ] |
|
Lots of good ideas here. I would like to emphasize that we want the lock limit work done before the other improvements like shrinking ldlm_lock. I think the shrinking of ldlm_lock is worthy of a separate jira ticket. For the proc tunable, I think we really want that to be a size (bytes) rather than a lock count limit. Converting a size limit into a count limit is easy for code, but is difficult for humans (system administrators). I do agree that we'll need to free more than one lock at a time when the limit is hit. |
| Comment by Christopher Morrone [ 14/May/15 ] |
|
I had a conversation with Brian here, and I want to share some thoughts from that: First, perhaps we just need to add an LRU for ldlm locks. Ultimately that makes the lock reclaim decisions much easier than searching the hash. You have already found ways to more than offset the size increase in the ldlm_lock structure to account for LRU pointers. Next, just to take a step back and state the obvious, our goal with Lustre memory management is not really to have lots of individual hard limits. We'll never get the balance of a number of fixed limits correct all the time, and we'll either have under utilized memory or still have OOMs. The bigger issue is that Lustre's memory use does not play well with the standard Linux memory reclaim mechanisms. Ultimately, we really want to be able to just set Linux's min_free_kbytes to something reasonable and not have the system OOM due the inability of the kernel to reign in its own memory use. Since we do not currently have a way to identify ldlm locks to free under load, we must not be tying in to the kernel reclaim very well. Partly that is probably due to the fact that revoking ldlm locks requires sending RPCs. We should not do any blocking operations in the direct memory reclaim path, so it is obvious why we don't reclaim locks there. But in indirect memory reclaim, such as kswapd, it is much more reasonable to go the extra mile to reclaim locks. So if we do the work to identify which locks can be freed, we should probably go the extra step to get that tied into the kernel indirect reclaim path. That might even obviate the need for an explicit hard lock limit. And then there are probably lots of other areas in lustre where similar lock reclaim needs to be enabled to work well with the normal kernel mechanisms. But locks are still a really good place to start since they are obviously a large issue in production for us. |
| Comment by Niu Yawei (Inactive) [ 15/May/15 ] |
To my understanding, the only way to connect with kernel (in)direct reclaim mechanism is the registered slab shrinker, when kernel is under memory pressure (or kswapd wants to reclaim memory), the shrinker will be called to inform the slab user to free objects. The shrinker interface doesn't fit into the ldlm server lock very well, because kernel assumes the objects being freed in the shrinker and the number of freed objects being returned by shrinker, however, server lock shrinker can only inform clients to cancel locks (or revoke locks), server locks won't be freed instantly unless we wait in the shrinker until all things done. (I think it's impractical to block the shrinker for that long time even with the indirect reclaim path, imagine if some clients failed to response, it has to wait for the clients being evicted) So the best we can do with the shrinker is to inform clients to cancel locks (or choose some locks to revoke them). Lustre do have shrinker for ldlm server locks, it just decrease the SLV and hope client start to cancel locks when they are aware of this decreasing. This scheme seems doesn't work well so far, because: I think we may consider to decouple the shrinker with SLV to make sure the shrinker working as expected, there are two ways to reconstruct the shrinker: 1. Inform client to cancel certain locks: We'd manage to inform clients that server wants some locks to be reclaimed immediately, and client should choose & cancel certain amount of unused locks once it get this notification (no matter how much the SLV/CLV is). The advantage of this method is that client always knows better which locks are unused than server, and the lock choosing work can be offloaded to clients. The problem is how to inform all clients effectively and instantly rather than replying on PING? An idea in my mind is to use a global ldlm lock like quota does. (each client enqueue a global lock, and server informs clients by glimpse callback) Actually this global lock can be a generic tunnel (server -> client) for all kinds of other purposes, but I'm not sure how much complexity it'll bring us. 2. Server identify unused locks and send blocking ast to them directly. This way looks require less code changes. Chris, I think this is the "extra step to get that tied into the kernel indirect reclaim path" you mentioned? Am I understanding right? |
| Comment by Andreas Dilger [ 15/May/15 ] |
|
Chris, I agree that adding yet another tunable isn't ideal, but that is what you asked for ("I would propose a configurable hard limit on the number of locks (or space used by locks) on the server side.") and I agree this is a reasonable short-term solution. The long term solution you propose (better integration with the kernel slab shrinkers) is exactly what the LDLM pool mechanism is supposed to do, but clearly it isn't working. Whether that is a core LDLM pool design/implementation problem, or something that has crept in over several generations of kernel slab shrinker interfaces isn't yet clear. Having a server-side LRU would potentially introduce lock contention, but maybe a per-CPU LRU would be enough, and easily implemented. Niu, I fully agree that having a client-side max lock age is very useful, and there already is such a thing - lru_max_age. One problem is that the default age is 10h, which is much too long. It should be at most 65min - just long enough to keep files around for checkpoints every hour. If that age is not being used in the LRUR policy, then it should be. |
| Comment by Andreas Dilger [ 15/May/15 ] |
|
Niu, as for the LDLM pool notification, I think a DLM blocking callback on some "NULL" lock (e.g. resource 0,0,0,0) with a count and/or LV might be easily implemented as a mechanism to indicate to a client to "cancel some lock(s)", but this would need a protocol change. |
| Comment by Christopher Morrone [ 15/May/15 ] |
|
Yes, I was thinking of identifying least recently used locks on the server and revoking the locks directly under the indirect reclaim path. If the lock revocations are too heavy weight to even do in the indirect reclaim path, then we can certainly have an ldlm cleanup process that is woken up by the direct and/or indirect reclaim paths to do the heavy lifting. I am not sure that the SLV concept can ever really work if it only adjusts SLV as a result of calls from the shrinker. By the time that the kernel knows that is needs to reclaim memory it is too late to impose soft, voluntary lock reductions on the clients. When the kernel needs memory we need to take the most direct actions that we can to free the memory as soon as possible. For SLV to work reasonably at all, it would likely need to be actively monitoring system memory usage and adjusting the value to constrain aggregate lock usage well in advance of the shrinker demanding memory. |
| Comment by Andreas Dilger [ 15/May/15 ] |
|
In the long term, I think the whole mechanism of calculating and using the LV from the sever needs to be examined. It isn't clear that this starts lock cancellation early enough on the clients, and whether the pressure is there to actually cancel locks. It would be useful to monitor lock volume calculations over time and under MDS memory pressure to see if it is doing anything useful at all, and fix it if not. |
| Comment by Christopher Morrone [ 15/May/15 ] |
Yes, I agree. |
| Comment by Niu Yawei (Inactive) [ 18/May/15 ] |
I agree, 10h is too long, and now it's only used for AGED policy (when lru_resize not enabled). I'll add it for LRUR.
Yes, I think we may need this in the long-term solution.
Totally agreed, that's why I want to decouple the SLV mechanism with shrinker.
Yes, I think that was the design goal of SLV, but looks it didn't work as expected. Now looks the conclusion are: Short-term solution:
Long-term solution:
Let's start with the short-term solution. |
| Comment by Gerrit Updater [ 19/May/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14856 |
| Comment by Niu Yawei (Inactive) [ 25/May/15 ] |
|
There will be two patches for the short-term solution, the first one has been posted, I'm working on the second one, most coding is done, will push it soon. |
| Comment by Gerrit Updater [ 25/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14856/ |
| Comment by Gerrit Updater [ 26/May/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14931 |
| Comment by Gerrit Updater [ 19/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14931/ |
| Comment by Peter Jones [ 27/Jul/15 ] |
|
Landed for 2.8 |
| Comment by Christopher Morrone [ 28/Aug/15 ] |
|
I am reopening this ticket, because I think there are a few things about the patch that really should be addressed before Lustre 2.8 is released. First of all, the number that one writes to the waterfall files is not necessarily the number that one reads back. This is going to be very confusing for users. The problem seems to be some kind of rounding error, where the code winds up rounding down to the closes multiple of 1 MiB. Especially confusing is to write "1M" to the file and then read back "0". "0" means that the threshold is disabled, so that is really not acceptable. Instead of doing multiple steps of math (including do_div) on the number supplied by the user, the number supplied by the user needs to be recorded unmolested and supplied back to the user when they read from the file. Next, it was pointed out to me that "watermark_mb_low" is not a terribly good name. That implies that the number can't go any lower. Perhaps better names for these files would be something like:
I think that might be clearer to users that "reclaim_threshold_mb" is the point where reclaim kicks in, and "lock_limit_mb" is the limit beyond which no further locks will be permitted. Next there is this comment, which I think is misleading:
For the case that we are about talking about here in this ticket, I disagree rather strongly with the claim that letting the clients decide on the lock reclaim is "ideal". It would only be reasonable to let the clients decide when the server is not under memory pressure. This entire ticket is explicitly about dealing with the situation where the server is under memory pressure. Ideal would be to integrate with the kernel shrinker instead of having these thresholds. When kernel pressure is high enough, using additional memory to send messages to the clients and then waiting a potentially very long time for a response is simply not reasonable. The server will always need to be able to summarily discard locks. |
| Comment by Niu Yawei (Inactive) [ 28/Aug/15 ] |
|
I agree with that watermark numbers are confusion (numbers written are not equal to the numbers reading back), I'll change it, and I'm fine to change the name as you suggested.
Maybe the wording of the comment isn't accurate, what I want to expressed is that choosing locks to be reclaimed should be done by client (because client knows better which locks are not used), server of course should decide when to start lock reclaim (because server knows memory pressure). It's not ideal, but I think it's the best we can do so far. Anyway, given it's misleading, I'm going to remove it.
I think that's the point why we should always have an upper threshold, it's the safe guarding for server to not enter the situation you mentioned (under severe memory pressure, don't have enough memory to revoke locks), it may not enough, and we can add other restrictions later if necessary. The lower threshold may be removed if we integrate with the kernel shrinker in the future (Let shrinker to take over in non-urgent situation). Discarding locks means some clients have to be evicted, I think that's the case we must try to avoid. (by some other manners, set the upper limit for instance) |
| Comment by Gerrit Updater [ 28/Aug/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16123 |
| Comment by Christopher Morrone [ 28/Aug/15 ] |
|
Another bug: The proc files accept negative values. Negative values should be rejected. |
| Comment by Christopher Morrone [ 31/Aug/15 ] |
|
I am also disappointed that the patch passed review with so little in the way of function comments. Aren't function comments are landing requirement? |
| Comment by Christopher Morrone [ 31/Aug/15 ] |
|
Under Lustre 2.5.4 + local patches, we seem to be hitting the high lock limit prematurely, at least as far as we can tell from the number of ldlm_locks active on the slab. Is there some other way to get an idea of what the lustre server thinks is the current lock count? |
| Comment by Niu Yawei (Inactive) [ 01/Sep/15 ] |
Well, it's same as other Lustre proc files which rely on some basic helper functions. I think it's worth a new ticket to fix this.
I'll try to add more comments in the next patch.
There is a counter for the ldlm lock, but it's not exported, maybe I'd export it via proc for debug purpose (together with the proc interface changes). In current Lustre, you can roughly get the number by adding up the /proc/fs/lustre/ldlm/namespaces/$target/pool/granted for all the MDT/OST targets on server. |
| Comment by Gerrit Updater [ 16/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16123/ |
| Comment by Joseph Gmitter (Inactive) [ 16/Sep/15 ] |
|
All patches have landed for 2.8. |
| Comment by D. Marc Stearman (Inactive) [ 19/Feb/16 ] |
|
Chris, do we have this patch in our local release? |
| Comment by Christopher Morrone [ 19/Feb/16 ] |
|
Patches. Yes. |
| Comment by D. Marc Stearman (Inactive) [ 19/Feb/16 ] |
|
Thanks. I'll close the issue. |