Description
During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.
After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.
Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).
iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.
lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.
We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html
https://jira.whamcloud.com/browse/LU-11415
https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)
Any help or suggestions would be awesome.
Thanks!
- Jeff
Sort of. It used 12 of the 24 configured threads. I've since reduced this, but wanted to mention what I was seeing in testing.
I performed quite a few more tests today with the
LU-12815patch applied and various tunings, and have some good news. With the patch, we can see nearly line rate with lnet selftest (11.5-12.0GB/s, up from ~2.5GB/s). Current tunings:8 conns_per_peer seemed to give the best performance, and nscheds had to be increased because I noticed that the 6 default threads were all 100% pegged during an lnet selftest.
Unfortunately, this isn't reflecting in the single node IOR numbers. While we saw a ~5x increase in the lnet selftest numbers, we're only seeing a 2x increase in IOR numbers. IOR writes went from ~5GB/s to 9.8GB/s and reads went from ~1.3GB/s to 2.6GB/s on a file-per-OST test (12 OSTs, 6 OSSs). Really trying to understand the brutal read disparity; hoping you all have some thoughts. The writes seem to prove that we can push that bandwidth over the network at least, but is there something about the read path that's different from a networking perspective?