[LU-6529] Server side lock limits to avoid unnecessary memory exhaustion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

As seen in tickets like ~~LU-5727~~, we currently rely almost entirely on the good aggregate behavior of the lustre clients to avoid memory exhaustion on the MDS (and other servers, no doubt).

We require the servers, the MDS in particular, to instead limit ldlm lock usage to something reasonable to avoid OOM conditions on their own. It is not good design to leave the MDS's memory usage entirely up to the very careful administrative limiting of ldlm lock usage limits across all of the client nodes.

Consider that some sites have many thousands of clients across many clusters where such careful balancing and coordinated client limits may be difficult to achieve. Consider also WAN usages, where some clients might not ever reside at the same organization as the servers. Consider also bugs in the client, again like ~~LU-5727~~.

See also the attached graph showing MDS memory usage. Clearly the ldlm lock usage grows without bound, and other parts of the kernel memory usage are put under undue pressure. 70+ GiB of ldlm lock usage is not terribly reasonable for our setup.

Some might argue that the SLV code needs to be fixed, and I have no argument against pursuing that work. That could certainly be worked in some other ticket.

But even if SLV is fixed, we still require enforcement of good memory usage on the server side. There will always be client bugs or misconfiguration on clients and the server OOMing is not a reasonable response to those issues.

I would propose a configurable hard limit on the number of locks (or space used by locks) on the server side.

I am open to other solutions, of course.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MemUsage.png
91 kB
27/Apr/15 8:37 PM

Issue Links

is related to

LU-7266 Fix LDLM pool to make LRUR working properly

Open

LU-6390 lru_size on the OSC is not honored

Resolved

LU-6929 typo in cfs_hash_for_each_relax()

Resolved

LU-1520 client fails MDS connection and stack threads on another client

Resolved

LU-6775 Reduce memory footprint of ldlm_lock and ldlm_resource

Resolved

LU-11672 improving lru_max_age policy when lru resize is disabled

Resolved

LU-14221 Client hangs when using DoM with a fixed mdc lru_size

Closed

LU-11509 LDLM: replace client lock LRU with improved cache algorithm

Open

LU-14858 kernfs tree to dump/traverse ldlm lock resources for debug

Open

LU-14859 cancel client DLM locks from the server

Open

LU-14517 Decrease default lru_max_age value

Resolved

is related to

LU-5727 MDS OOMs with 2.5.3 clients and lru_size != 0

Resolved

LU-8209 glimpse lock request does not engage ELC to drop unneeded locks

Resolved

LU-17428 reduce default value for lru_max_age to 300s

Resolved

(6 is related to, 3 is related to )

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 27/Apr/15 8:37 PM

Updated:: 27/Aug/24 11:48 PM

Resolved:: 16/Sep/15 2:29 PM