Details
-
Improvement
-
Resolution: Fixed
-
Critical
-
None
-
9223372036854775807
Description
OPA driver optimizations are based on the MPI model where it is expected to have multiple endpoints between two given nodes. To enable this optimization for Lustre, we need to make it possible, via an LND-specific tuneable, to create multiple endpoints and to balance the traffic over them.
I have already created an experimental patch to test this theory out. I was able to push OPA performance to 12.4GB/s by just having 2 QPs between the nodes and round robin messages between them.
This Jira ticket is for productizing my patch and testing it out thoroughly for OPA and IB. Test results will be posted to this ticket.
Attachments
Issue Links
- has to be finished together with
-
LUDOC-374 Add notes about conns_per_peer ko2iblnd parameter
-
- Resolved
-
That might be the reason. The client will create multiple connections, but the server will only have one they are all talking to. When one connection on the client is closed, the connection on the server will be closed. I suspect the remaining connections on the client can't be closed. I'll have to look at the code to see what I can do in this situation.
I suspect if the server has the patch, you would not have a problem.