[LU-7266] Fix LDLM pool to make LRUR working properly - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Current LDLM pool doesn't work as expected, it leads to server sometimes can be overburdened by too many cached LDLM locks (observed in ~~LU-6529~~), to fix the LDLM pool and make the LRUR working properly, I think following particular issues need be addressed:

1. No hard limit for the server lock count;

There are always exceptions (all locks are actively used on client, client failure, network lag, etc) can make the client unable to cancel lock in time, so a hard limit for the server lock count is crucial to make sure the server not being overburdened by LDLM lock at any time. When the server lock count exceeds the hard limit, server should reject any incoming lock enqueue request (Let client retry on -EINPROGRESS) until the lock count shrink back to a safe zone.

This issue has been addressed by ~~LU-6529~~.

2. Lock load can't be balanced between the namespaces on same server;

Current LDLM pool divide the memory quota equally for each namespace (MDTs, OSTs) on the same host, that could result in lots of memory is reserved by the idle namespaces but not used.

I think we'd leverage the global lock counter introduced in ~~LU-6529~~ to address this problem.

3. Client needs to cancel lock more aggressively;

According to the SLV/CLV formula, server lock count won't be decreased even if it has already exceeded the pool_limit (which is 25% total memory by default), a simulate program shows that server lock count only decreases after the lock consumed more than 32% of total memory, the number is calculated with the assumption that all locks on client are unused and client can always cancel lock instantly. I think the number will be larger in real world.

As a workaround, I think this problem could be addressed by tweaking the LDLM pool parameters (such as pool_limit, lock_volume_factor, etc.).

In a longer term solution, I think we'd get rid of the complexity of SLV recalculation, instead, just notify client directly with an estimated CLV when server is aware of memory pressure (or when server decide to reclaim some memory)

4. Current server pool shrinker is barely functional;

Current LDLM server pool shrinker decreases only a small amount of SLV , that's not enough to trigger lock cancel on client, to make it worse, the decreased SLV could be overwritten by SLV recalculation thread before it's carried back to client by some random RPC.

As I mentioned in the long term solution of 3rd item, I think server pool shrinker should just notify client with an estimated CLV initiatively, that's simpler and more reliable.

5. Improve LRU algorithm

Using strict LRU to replace cached locks is sub-optimal due to cache thrashing and removal of valuable locks. A better algorithm like LFRU or ARC would improve lock cache reuse and value. (LU-11509)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lru-resize-dld.lyx
64 kB
12/Mar/21 9:09 PM
lru-resize-hld.lyx
23 kB
12/Mar/21 9:08 PM

Issue Links

is related to

LU-14221 Client hangs when using DoM with a fixed mdc lru_size

Closed

LU-14517 Decrease default lru_max_age value

Resolved

is related to

LU-1128 Complete investigation of the LDLM pool shrinker and SLV handling

Resolved

LU-5152 Can't enforce block quota when unprivileged user change group

Resolved

LU-6529 Server side lock limits to avoid unnecessary memory exhaustion

Closed

LU-11509 LDLM: replace client lock LRU with improved cache algorithm

Open

LU-17428 reduce default value for lru_max_age to 300s

Resolved

(2 is related to )

Activity

[LU-7266] Fix LDLM pool to make LRUR working properly

There are no comments yet on this issue.

Fix LDLM pool to make LRUR working properly

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates