Details
-
Improvement
-
Resolution: Fixed
-
Major
-
Lustre 1.8.6
-
None
-
10451
Description
Description of the problem:
In 1.8 codebase all modifying parallel operations on a single dir are bottlenecked on a single ldlm lock. To make matters worse, the dlm lock is not only held while the operation is performed, but also until a confirmation is received from the client executing each op, one by one (process known as rep-ack).
Another downside of that is MDT threads are actually blocked waiting for a lock, so if sufficiently many clients try to modify a directory ocntent in parallel, they occupy all MDT threads and so even unrelated processes now cannot perform any metadata operations until the backlog of requests is cleared. This often leads to delays of many seconds to several minutes (at ORNL during large scale file creating jobs) of much fs activity when certain jobs are run.
Proposed solution:
Introduce an extra 'hash' property for inodebit locks. The users will pass it along with bits they want to lock. The locks with intersecting bits and incompatible lock modes would only be declared incompatible and block each other if the hash values match, or any of the locks have the special hash value of zero.
The MDT threads for modifying operations would initialize the value to the hash of the name of the file the operation modifies.
Clients and MDS with other operations would keep hash value as zero in requested locks.
Since the parent lock obtained during modifying metadata operations on MDS is not passed to the client, the is no protocol change.
The ldlm policy descriptor in the wire structure also has enough space to accommodate 64 bits of hash value.
As a result the MDS threads would stop blocking on shared directory operations with the exception of a case where multiple threads attempt to create the same file. This usecase will need to be addressed separately.
Downsides:
The lock resource on MDS corresponding to the parent directory lock in which shared operations are performed will accumulate a lot of locks (one for each modifying name in a dir) and their traversal might become cpu-consuming as number of modification grows.
This problem can be significantly compounded should the Commit on Share (COS) logic be implemented in 1.8 codebase later on as that lets locks in the resource accumulate for several seconds until underlying filesystem commits the transactions instead just for the duration of an RPC round-trip for the rep-ack process.
Still in the absence of COS logic the end-result should be better compared to current behavior as a forward progress could be made on multiple CPUs.
UNMEASURED: large SMP systems need to be tested separately to see how data pingpong between CPUs iterating locks on the resource affects overall performance in the face of a lot of clients doing shared-dire creates/unlinks.
Alternatives:
An approach implemented in 2.0+ code could be adopted where instead of a value in inodebits policy the hash value for the name being modified would be used as part of the resource name. This would totally erase the downsides listed with the first method.
This approach would is like this: for every modifying operation two locks would need to be taken. One exclusive lock with the modified resource to protect against concurrent operations on the same name and the other lock just on the parent resource id in a "semi-shared" way so that multiple MDS threads could obtain compatible locks, but the lock used by clients to guard READDIR pages would be conflicting.
Since all modifying metadata operations would need to take two locks on server now, the obvious downside of this approach is the cpu utilization on MS would grow and metadata operations not done in a shared directory would still be somewhat penalized with the extra lock.
UNMEASURED: but possible that on large SMP systems the overhead from cpu data and spinlock-pingpong in case of shared-dir modifications might be quite substantial.
Protocol compatibility for this alternative approach is still not a concern since clients currently obtain PR lock on the parent directory, and so if the server threads to use CW lock on the parent dir, they would not be conflicting between themselves, but would conflict with PR locks held by clients.
Out of scope:
Modification of vfs and backend filesystem allowing parallel processing of multiple modifying requests within a single directory in parallel. (so called pdirops patch).
Useful information:
ldlm modifications outlined for the 1st approach are implemented in bug 21660 (patch id 30074) for 2.0 codebase that should be pretty similar to 1.8 in ldlm area.