Description
During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.
After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.
Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).
iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.
lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.
We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html
https://jira.whamcloud.com/browse/LU-11415
https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)
Any help or suggestions would be awesome.
Thanks!
- Jeff
Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now.
Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed?
Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a
LU-12815patch and test with the conns-per-peer feature. This is the quickest path forward to test, rather than re-deploying without kernel bonding. Will update with how this goes tomorrow.