[LU-6529] Server side lock limits to avoid unnecessary memory exhaustion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

As seen in tickets like ~~LU-5727~~, we currently rely almost entirely on the good aggregate behavior of the lustre clients to avoid memory exhaustion on the MDS (and other servers, no doubt).

We require the servers, the MDS in particular, to instead limit ldlm lock usage to something reasonable to avoid OOM conditions on their own. It is not good design to leave the MDS's memory usage entirely up to the very careful administrative limiting of ldlm lock usage limits across all of the client nodes.

Consider that some sites have many thousands of clients across many clusters where such careful balancing and coordinated client limits may be difficult to achieve. Consider also WAN usages, where some clients might not ever reside at the same organization as the servers. Consider also bugs in the client, again like ~~LU-5727~~.

See also the attached graph showing MDS memory usage. Clearly the ldlm lock usage grows without bound, and other parts of the kernel memory usage are put under undue pressure. 70+ GiB of ldlm lock usage is not terribly reasonable for our setup.

Some might argue that the SLV code needs to be fixed, and I have no argument against pursuing that work. That could certainly be worked in some other ticket.

But even if SLV is fixed, we still require enforcement of good memory usage on the server side. There will always be client bugs or misconfiguration on clients and the server OOMing is not a reasonable response to those issues.

I would propose a configurable hard limit on the number of locks (or space used by locks) on the server side.

I am open to other solutions, of course.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MemUsage.png
91 kB
27/Apr/15 8:37 PM

Issue Links

is related to

LU-7266 Fix LDLM pool to make LRUR working properly

Open

LU-6390 lru_size on the OSC is not honored

Resolved

LU-6929 typo in cfs_hash_for_each_relax()

Resolved

LU-1520 client fails MDS connection and stack threads on another client

Resolved

LU-6775 Reduce memory footprint of ldlm_lock and ldlm_resource

Resolved

LU-11672 improving lru_max_age policy when lru resize is disabled

Resolved

LU-14221 Client hangs when using DoM with a fixed mdc lru_size

Closed

LU-11509 LDLM: replace client lock LRU with improved cache algorithm

Open

LU-14858 kernfs tree to dump/traverse ldlm lock resources for debug

Open

LU-14859 cancel client DLM locks from the server

Open

LU-14517 Decrease default lru_max_age value

Resolved

is related to

LU-5727 MDS OOMs with 2.5.3 clients and lru_size != 0

Resolved

LU-8209 glimpse lock request does not engage ELC to drop unneeded locks

Resolved

LU-17428 reduce default value for lru_max_age to 300s

Resolved

(6 is related to, 3 is related to )

Activity

[LU-6529] Server side lock limits to avoid unnecessary memory exhaustion

Christopher Morrone (Inactive) added a comment - 19/Feb/16 9:52 PM

Patches. Yes.

Christopher Morrone (Inactive) added a comment - 19/Feb/16 9:52 PM Patches. Yes.

D. Marc Stearman (Inactive) added a comment - 19/Feb/16 7:14 PM

Chris, do we have this patch in our local release?

D. Marc Stearman (Inactive) added a comment - 19/Feb/16 7:14 PM Chris, do we have this patch in our local release?

Joseph Gmitter (Inactive) added a comment - 16/Sep/15 2:29 PM

All patches have landed for 2.8.

Joseph Gmitter (Inactive) added a comment - 16/Sep/15 2:29 PM All patches have landed for 2.8.

Gerrit Updater added a comment - 16/Sep/15 1:06 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16123/
Subject: ~~LU-6529~~ ldlm: improve proc interface of lock reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 33b55f223a42f20916bc417f7e5a21f68b59cd02

Gerrit Updater added a comment - 16/Sep/15 1:06 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16123/ Subject: LU-6529 ldlm: improve proc interface of lock reclaim Project: fs/lustre-release Branch: master Current Patch Set: Commit: 33b55f223a42f20916bc417f7e5a21f68b59cd02

Niu Yawei (Inactive) added a comment - 01/Sep/15 1:23 AM - edited

Another bug: The proc files accept negative values. Negative values should be rejected.

Well, it's same as other Lustre proc files which rely on some basic helper functions. I think it's worth a new ticket to fix this.

I am also disappointed that the patch passed review with so little in the way of function comments. Aren't function comments are landing requirement?

I'll try to add more comments in the next patch.

Under Lustre 2.5.4 + local patches, we seem to be hitting the high lock limit prematurely, at least as far as we can tell from the number of ldlm_locks active on the slab. Is there some other way to get an idea of what the lustre server thinks is the current lock count?

There is a counter for the ldlm lock, but it's not exported, maybe I'd export it via proc for debug purpose (together with the proc interface changes).

In current Lustre, you can roughly get the number by adding up the /proc/fs/lustre/ldlm/namespaces/$target/pool/granted for all the MDT/OST targets on server.

Niu Yawei (Inactive) added a comment - 01/Sep/15 1:23 AM - edited Another bug: The proc files accept negative values. Negative values should be rejected. Well, it's same as other Lustre proc files which rely on some basic helper functions. I think it's worth a new ticket to fix this. I am also disappointed that the patch passed review with so little in the way of function comments. Aren't function comments are landing requirement? I'll try to add more comments in the next patch. Under Lustre 2.5.4 + local patches, we seem to be hitting the high lock limit prematurely, at least as far as we can tell from the number of ldlm_locks active on the slab. Is there some other way to get an idea of what the lustre server thinks is the current lock count? There is a counter for the ldlm lock, but it's not exported, maybe I'd export it via proc for debug purpose (together with the proc interface changes). In current Lustre, you can roughly get the number by adding up the /proc/fs/lustre/ldlm/namespaces/$target/pool/granted for all the MDT/OST targets on server.

Christopher Morrone (Inactive) added a comment - 31/Aug/15 7:06 PM

Under Lustre 2.5.4 + local patches, we seem to be hitting the high lock limit prematurely, at least as far as we can tell from the number of ldlm_locks active on the slab. Is there some other way to get an idea of what the lustre server thinks is the current lock count?

Christopher Morrone (Inactive) added a comment - 31/Aug/15 7:06 PM Under Lustre 2.5.4 + local patches, we seem to be hitting the high lock limit prematurely, at least as far as we can tell from the number of ldlm_locks active on the slab. Is there some other way to get an idea of what the lustre server thinks is the current lock count?

Christopher Morrone (Inactive) added a comment - 31/Aug/15 7:04 PM

I am also disappointed that the patch passed review with so little in the way of function comments. Aren't function comments are landing requirement?

Christopher Morrone (Inactive) added a comment - 31/Aug/15 7:04 PM I am also disappointed that the patch passed review with so little in the way of function comments. Aren't function comments are landing requirement?

Christopher Morrone (Inactive) added a comment - 28/Aug/15 5:58 PM

Another bug: The proc files accept negative values. Negative values should be rejected.

Christopher Morrone (Inactive) added a comment - 28/Aug/15 5:58 PM Another bug: The proc files accept negative values. Negative values should be rejected.

Gerrit Updater added a comment - 28/Aug/15 7:56 AM

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16123
Subject: ~~LU-6529~~ ldlm: improve proc interface of lock reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cd15ce613f958e14fd8c8a01a97cdd67cb17e249

Gerrit Updater added a comment - 28/Aug/15 7:56 AM Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16123 Subject: LU-6529 ldlm: improve proc interface of lock reclaim Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cd15ce613f958e14fd8c8a01a97cdd67cb17e249

Niu Yawei (Inactive) added a comment - 28/Aug/15 3:05 AM

I agree with that watermark numbers are confusion (numbers written are not equal to the numbers reading back), I'll change it, and I'm fine to change the name as you suggested.

For the case that we are about talking about here in this ticket, I disagree rather strongly with the claim that letting the clients decide on the lock reclaim is "ideal". It would only be reasonable to let the clients decide when the server is not under memory pressure. This entire ticket is explicitly about dealing with the situation where the server is under memory pressure.

Maybe the wording of the comment isn't accurate, what I want to expressed is that choosing locks to be reclaimed should be done by client (because client knows better which locks are not used), server of course should decide when to start lock reclaim (because server knows memory pressure). It's not ideal, but I think it's the best we can do so far. Anyway, given it's misleading, I'm going to remove it.

Ideal would be to integrate with the kernel shrinker instead of having these thresholds. When kernel pressure is high enough, using additional memory to send messages to the clients and then waiting a potentially very long time for a response is simply not reasonable. The server will always need to be able to summarily discard locks.

I think that's the point why we should always have an upper threshold, it's the safe guarding for server to not enter the situation you mentioned (under severe memory pressure, don't have enough memory to revoke locks), it may not enough, and we can add other restrictions later if necessary. The lower threshold may be removed if we integrate with the kernel shrinker in the future (Let shrinker to take over in non-urgent situation).

Discarding locks means some clients have to be evicted, I think that's the case we must try to avoid. (by some other manners, set the upper limit for instance)

Niu Yawei (Inactive) added a comment - 28/Aug/15 3:05 AM I agree with that watermark numbers are confusion (numbers written are not equal to the numbers reading back), I'll change it, and I'm fine to change the name as you suggested. For the case that we are about talking about here in this ticket, I disagree rather strongly with the claim that letting the clients decide on the lock reclaim is "ideal". It would only be reasonable to let the clients decide when the server is not under memory pressure. This entire ticket is explicitly about dealing with the situation where the server is under memory pressure. Maybe the wording of the comment isn't accurate, what I want to expressed is that choosing locks to be reclaimed should be done by client (because client knows better which locks are not used), server of course should decide when to start lock reclaim (because server knows memory pressure). It's not ideal, but I think it's the best we can do so far. Anyway, given it's misleading, I'm going to remove it. Ideal would be to integrate with the kernel shrinker instead of having these thresholds. When kernel pressure is high enough, using additional memory to send messages to the clients and then waiting a potentially very long time for a response is simply not reasonable. The server will always need to be able to summarily discard locks. I think that's the point why we should always have an upper threshold, it's the safe guarding for server to not enter the situation you mentioned (under severe memory pressure, don't have enough memory to revoke locks), it may not enough, and we can add other restrictions later if necessary. The lower threshold may be removed if we integrate with the kernel shrinker in the future (Let shrinker to take over in non-urgent situation). Discarding locks means some clients have to be evicted, I think that's the case we must try to avoid. (by some other manners, set the upper limit for instance)

Christopher Morrone (Inactive) added a comment - 28/Aug/15 1:50 AM

I am reopening this ticket, because I think there are a few things about the patch that really should be addressed before Lustre 2.8 is released.

First of all, the number that one writes to the waterfall files is not necessarily the number that one reads back. This is going to be very confusing for users. The problem seems to be some kind of rounding error, where the code winds up rounding down to the closes multiple of 1 MiB. Especially confusing is to write "1M" to the file and then read back "0". "0" means that the threshold is disabled, so that is really not acceptable.

Instead of doing multiple steps of math (including do_div) on the number supplied by the user, the number supplied by the user needs to be recorded unmolested and supplied back to the user when they read from the file.

Next, it was pointed out to me that "watermark_mb_low" is not a terribly good name. That implies that the number can't go any lower. Perhaps better names for these files would be something like:

lock_limit_mb
reclaim_threshold_mb

I think that might be clearer to users that "reclaim_threshold_mb" is the point where reclaim kicks in, and "lock_limit_mb" is the limit beyond which no further locks will be permitted.

Next there is this comment, which I think is misleading:

/*

FIXME:
*

In current implementation, server identifies which locks should be

revoked by choosing locks from namespace/resource in a roundrobin

manner, which isn't optimal. The ideal way should be server notifies

clients to cancel locks voluntarily, because only client knows exactly

when the lock is last used.

For the case that we are about talking about here in this ticket, I disagree rather strongly with the claim that letting the clients decide on the lock reclaim is "ideal". It would only be reasonable to let the clients decide when the server is not under memory pressure. This entire ticket is explicitly about dealing with the situation where the server is under memory pressure.

Ideal would be to integrate with the kernel shrinker instead of having these thresholds. When kernel pressure is high enough, using additional memory to send messages to the clients and then waiting a potentially very long time for a response is simply not reasonable. The server will always need to be able to summarily discard locks.

Christopher Morrone (Inactive) added a comment - 28/Aug/15 1:50 AM I am reopening this ticket, because I think there are a few things about the patch that really should be addressed before Lustre 2.8 is released. First of all, the number that one writes to the waterfall files is not necessarily the number that one reads back. This is going to be very confusing for users. The problem seems to be some kind of rounding error, where the code winds up rounding down to the closes multiple of 1 MiB. Especially confusing is to write "1M" to the file and then read back "0". "0" means that the threshold is disabled, so that is really not acceptable. Instead of doing multiple steps of math (including do_div) on the number supplied by the user, the number supplied by the user needs to be recorded unmolested and supplied back to the user when they read from the file. Next, it was pointed out to me that "watermark_mb_low" is not a terribly good name. That implies that the number can't go any lower. Perhaps better names for these files would be something like: lock_limit_mb reclaim_threshold_mb I think that might be clearer to users that "reclaim_threshold_mb" is the point where reclaim kicks in, and "lock_limit_mb" is the limit beyond which no further locks will be permitted. Next there is this comment, which I think is misleading: /* FIXME: * In current implementation, server identifies which locks should be revoked by choosing locks from namespace/resource in a roundrobin manner, which isn't optimal. The ideal way should be server notifies clients to cancel locks voluntarily, because only client knows exactly when the lock is last used. For the case that we are about talking about here in this ticket, I disagree rather strongly with the claim that letting the clients decide on the lock reclaim is "ideal". It would only be reasonable to let the clients decide when the server is not under memory pressure. This entire ticket is explicitly about dealing with the situation where the server is under memory pressure. Ideal would be to integrate with the kernel shrinker instead of having these thresholds. When kernel pressure is high enough, using additional memory to send messages to the clients and then waiting a potentially very long time for a response is simply not reasonable. The server will always need to be able to summarily discard locks.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 27/Apr/15 8:37 PM

Updated:: 27/Aug/24 11:48 PM

Resolved:: 16/Sep/15 2:29 PM