Affects Version/s: None
Fix Version/s: None
When doing shared file writing in Lustre, a common problem is "lock exchange", where clients writing nearby (but not overlapping) blocks of a file end up repeatedly 'exchanging' write locks on stripes of a file. (Google "lock ahead" for more on this behavior)
The details of this behavior are complex, but one particularly bad part of it is that in the worst case scenario, each client only gets to do 1 i/o before it has to give back the lock & sync the data it just wrote.
This is mitigated by the fact that in practice clients often manage to do more than i/o under a lock before it is called back. (*See note on measuring this at the end of this drescription.)
It occurred to me that it should be possible to encourage this behavior by modifying the way LDLM lock matching works.
Specifically, today, once a BL callback is received for a lock, the lock can no longer be matched by new i/o requests. It can still be matched by page writeout (necessary as part of lock cancellation).
This guarantees that no new i/os can start under a lock after the BL callback has been received, so the lock can be cancelled as soon as the pages are written out.
This is important because without it one client could hold a lock forever just by starting new i/o under it.
But there is a possible middle ground, where we allow a few i/os to match the lock after the BL callback is received if they arrive in time.
This means that if there is no activity under a lock or only one thread doing i/o under a lock there will be no delay - It is cancelled as normal. But if there are multiple threads doing i/o to a particular lock, they will have a brief window in which they can still match the lock, but the total number of i/os allowed to do this is limited.
I refer to this as "sticky" LDLM locks, and this idea as "sticky" matching. The idea is that under contention, we exchange a little bit of latency in giving up the lock for getting more i/o done under each "exchange" of the lock. But, crucially, this will only occur under contention, where the lock is being used from multiple threads on a client at the time it's cancelled.
Performance testing is needed to verify this, but I think this could significantly increase performance in contended write scenarios.
*I refer to this is as lock usage ratio, and the number is:
Measuring this is pretty easy if you know how many writes your node is making during a job (which is trivial for benchmarks like IOR), so writes is known. "locks" can be closely approximately by counting bl callbacks on the client:
lctl get_param ldlm.services.ldlm_cbd.stats
Higher ratios are better, indicating that more i/o is being done per lock request/cancellation.
One way to look at this "sticky LDLM lock" idea is it's an attempt to improve the lock usage ratio.