Recently I'm benchmarking a newly setup lustre servers with 40Gbps ethernet network connection. Jumbo frame is enabled and MTU is set to 9000 for both NICs on the client and server side. The connection between client and server is really simple, they are under the same TOR switch, no routers in between.
Firstly I used iperf3 to verify the throughput between client and server, and the throughput is stable at 30~32gib/s from either direction. However, when I launched lnet selftest, I usually see less throughput than perf3, which is about ~2500MiB/s.
After speaking with Amir and Doug, I monitored ksocklnd threads on both client and server side, the problem we're seeing is that when lnet selftest is performing reading test, there is only one ksocklnd thread consuming 100% CPU time, while the other threads don't take any workload; write test is similar but only one server ksocklnd thread is busy doing work. The workload doesn't seem to spread out to all threads in the pool.
It would be possible that the only thread is enough to handle all the traffic so there is no need to launch the workload to the other threads, but it's also possible that there are some scheduling problems in the implementation of ksocklnd. Doug mentioned that o2iblnd could spread the workload well.