Details
-
Improvement
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
3
-
9223372036854775807
Description
System sizes keep growing including RAM and client counts.
This leads to greatly increasing lock counts in the wild, the other day I saw a system that had nearly 100 million locks on one of the servers. That does not sound much, but for example with our object hash size of 1<<32 buckets (already consumes 1M RAM) - that leads to every bucket containing about 1500 entries - that's how many we need to iterate in the worst case to do the handle -> object lookup, and we do a lot of those!
Certain workloads (e.g. cancel RPC with ELC where every request could carry hundreds of lock handles that are then rapidly iterated over, but there could be more) get disproportionally affected (LU-20205 for this particular case).
For the handle->object hash table possible soutions include:
increasing the hash even more (but then that's temporary again and uses lot's of RAM even on smaller systems where it's not really needed)
replacing the hashtable entirely with something else (possible options: xarray and rbtrees)
For other hash tables (like resource hashes per namespace) - making them resizable might make sense. I guess I'll make a separate ticket for this case so we can concentrate on the one type of the hash table here.
Attachments
Issue Links
- is related to
-
LU-20205 canceld cpubound in hpreq_check with many locks
-
- Open
-