Continuing to pursue this, there are four identified bottlenecks for unlink performance:
a. Parent directory mutex in the Linux kernel VFS (in do_unlinkat). This greatly affects shared directory operations within a single client (avg latency: 1ppn=0 microsecs, 4ppn=6460, while lustre unlink rpc=250 (constant for 1,4 ppn)). Measured with dir-per-client, to remove MDT shared directory ldiskfs lock and shared lock ldlm callbacks.
b. Single MDC rpc-in-flight (rpc_lock) serializing rpcs. This also greatly affects multiple ppn operations (1ppn=0, 2ppn=180, 4ppn=1080, 8ppn=3530). Measured with dir-per-process to avoid parent mutex, but sharing the same dirs between clients to include ldiskfs and ldlm effects. While it may be possible to remove this restriction (MRP-59), doing so may be very complex due to ordering issues. (Note LU-933 has a patch to remove this in an unsafe way, for testing.)
c. Shared-dir MDT ldlm lock. Lock callback time increases slowly with increased ppn (1ppn=130, 8ppn=240), possibly due to increasing client context switching time. Measured as in b above. Not expected to increase with client count. Possibly could be eliminated by having the MDS act as a proxy lock holder for multiple shared-dir clients, but not much gain possible here.
d. Shared-dir ldiskfs lock. Contention for the same lock on the MDT directory, mostly independent of ppn but will increase with client count (subtracting out lock callback latency, 1ppn=90, 8ppn=130, measured as in b with 8 clients). Pdirops (LU-50) in Lustre 2.2.0 should help with this.
Cliff, I was wondering if you have any 1 ppn mdtest results on Hyperion for different client counts? I can't find any on Maloo (and indeed the link above seem to have lost their logs as well). Also, can you tell me if the results above included pdirops (Lu-50, lustre 2.2.0)?
It looks as though the comments that were made on 04/Apr/12 were about right. LU-3308 was opened to look into the regression aspect (which is seemingly unrelated to multiple processes per node).