A "common" hang we have on our filesystems is when some clients experience temporary IB problems and other clients issue an occasional `mkdir -p` on a full arboresence.
Taking an example `/mnt/lustre/path/to/dir`, basically all the active clients have a CR or PR lock on `/mnt/lustre`, but the client issuing `mkdir -p /mnt/lustre/path/to/dir` will try to get a CW lock on each intermediate directories, which makes the server recall all PR locks on very accessed base directories.
Under bad weather, that recall can take time (as long as ldlm timeout), during which clients which did give the lock back will try to reestablish it and get blocked as well until that CW is granted and the no-op mkdir is done.. Which in turn can starve all the threads on the MDS in a situation with many clients and render the MDS unresponsive for a while even after resolving the IB issues.
I think we would be much less likely to experience such hangs if mkdir() would, on the client side, first check with a read lock if the child directory exist before upgrading that to a write lock and sending the request to the server. If the directory already exists we can safely return EEXIST without sending mkdir to the server, in most other cases we will need to upgrade lock and send request as we currently do.
This incurs a small overhead for the actual mkdir case but I believe it should be well worth it for large clusters, and in the common case a lock would already be held (if e.g. doing multiple operations on directory in a row)
On last occurrence here we had 324 mdt0[0-3]_[0-9]* threads stuck on this trace:
And 4 threads on this:
A finer examination lets us find a struct ldlm_resource in these threads, which point to a single directory (actually two in this case, we have a common directory to all users just below root here so it has just as much contention as the filesystem root itself).
The ldlm_resource also has lists (lr_granted, lr_waiting) which show a few granted CR (not revoked) and a handful of PR that aren't giving lock back, and many waiting PR + a handful of CW waiting for the last granted PR to go away.
I'm afraid I do not have exact traces of this anymore as it was on a live system and we didn't crash the MDT (+ it was on black site anyway), but if there are other things to look at I'm sure it will happen again eventually so please ask.