Loading...

XML

Word

Printable

Type: Improvement
Resolution: Unresolved
Priority: Medium
Fix Version/s: Lustre 2.18.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

System sizes keep growing including RAM and client counts.

This leads to greatly increasing lock counts in the wild, the other day I saw a system that had nearly 100 million locks on one of the servers. That does not sound much, but for example with our object hash size of 1<<32 buckets (already consumes 1M RAM) - that leads to every bucket containing about 1500 entries - that's how many we need to iterate in the worst case to do the handle -> object lookup, and we do a lot of those!

Certain workloads (e.g. cancel RPC with ELC where every request could carry hundreds of lock handles that are then rapidly iterated over, but there could be more) get disproportionally affected (~~LU-20205~~ for this particular case).

For the handle->object hash table possible soutions include:

increasing the hash even more (but then that's temporary again and uses lot's of RAM even on smaller systems where it's not really needed)

replacing the hashtable entirely with something else (possible options: xarray and rbtrees)

For other hash tables (like resource hashes per namespace) - making them resizable might make sense. I guess I'll make a separate ticket for this case so we can concentrate on the one type of the hash table here.

is related to

LU-20205 canceld cpubound in hpreq_check with many locks

Resolved

Assignee:: WC Triage

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 29/Apr/26 10:28 PM

Updated:: 20/Jun/26 3:42 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates