[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet Created: 04/Jan/21 Updated: 02/Mar/22 Resolved: 02/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jeff Niles | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | ORNL, ornl | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s. After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32. Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding). iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding. lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance. We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions: http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html https://jira.whamcloud.com/browse/LU-11415 https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)
Any help or suggestions would be awesome. Thanks!
|
| Comments |
| Comment by Peter Jones [ 04/Jan/21 ] |
|
Amir Could you please advise? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 05/Jan/21 ] |
|
Is the test between two nodes? What is your CPT configuration? By default it should be based on the NUMA config of the node. The CPT configuration controls the number of thread worker pools created in the socklnd. When you run your test do you see the work distributed over all the worker threads or only a subset of them? Can you share top while running a test? Regarding We can try to verify as follows: Would you be able to use MR instead of kernel bonding? You'd configure both the interfaces on the same LNet:
lnetctl net add --net tcp --if eth0,eth1
And then attempt to measure the performance again. If you see an improvement, then try to create multiple logical interfaces per interface and then include them all on the same LNet. Something like:
lnetctl net add --net tcp --if eth0,eth0:1,eth0:2,eth1,eth1:1,eth1:2
Doing that will create multiple sockets for read/write. Would be interesting to see the results of this experiment. |
| Comment by Andreas Dilger [ 05/Jan/21 ] |
|
Maybe I'm missing something obvious/unstated here, but if you get 30G*Bytes*/s for reads and 35G*Bytes*/s for writes (I'm assuming that is client-side performance with something like IOR, but the details would be useful), that is exceeding the 200 G*bits*/s ~= 25GBytes/s network bandwidth of the server? Are there multiple servers involved in your testing? What kind? How many clients? It would be useful to add this to the Environment section of the ticket. What is the individual CPU usage of the server during the testing (not the average across all cores)? Using TCP can be CPU hungry, and depending on how your TCP bonding is configured, it may put all of the load on a few cores, so "top" may show a 6% average CPU usage, but that is e.g. 100% of 1 of 16 cores. |
| Comment by Andreas Dilger [ 05/Jan/21 ] |
|
PS: I would agree with Amir that |
| Comment by Jeff Niles [ 05/Jan/21 ] |
|
Amir, The simplest test is between two nodes that reside on the same switch. CPT configuration is the default; in this case two partitions because we have two sockets on these. > lctl get_param cpu_partition_table cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 Top output shows 6 of the 12 threads contributing, all from one socket. We tried playing with the value of nscheds, which seems to default to 6. We attempted to set it to 24 to match core count, and while we did get 24 threads, it didn't make a difference. 21751 root 20 0 0 0 0 R 20.9 0.0 49:09.76 socknal_sd00_00 21754 root 20 0 0 0 0 S 17.9 0.0 49:17.12 socknal_sd00_03 21756 root 20 0 0 0 0 S 17.5 0.0 49:12.60 socknal_sd00_05 21753 root 20 0 0 0 0 S 16.9 0.0 49:12.37 socknal_sd00_02 21752 root 20 0 0 0 0 S 16.2 0.0 49:09.85 socknal_sd00_01 21755 root 20 0 0 0 0 S 16.2 0.0 49:14.87 socknal_sd00_04 I don't believe that That being said, my plan this morning is to test the system after completely removing the bond. I'm planning on using one single connection rather than both and will test it standalone and using MR with logical interfaces. Andreas, The 30/35GB/s numbers are from a system-wide IOR, so across more than a single host. I used it as an example, but to avoid expanding the scope of the ticket to include an entire cluster, I shouldn't have. To simplify things, single node IOR sees slightly less than the 2.5GB/s of an lnet selftest, so I've been focusing on single node to node performance for debugging. I guess I mentioned the system wide numbers just to state that scaling doesn't help, even with hundreds of clients. The individual CPU usage during a node to node test is fairly balanced across the cores. We don't seem to utilize any single core more than 35%. Command line for iperf is really basic. 6 TCP connections are needed to fully utilize the 100G link, with -P 1 producing a little over 20Gb/s. This does match with the 2.5GB/s number that we're seeing out of lnet selftest, but doesn't explain why we still only see 2.5GB/s when running a test with multiple lnet selftest "clients" to a single "server", as that should be producing multiple TCP connections. Maybe our understanding here is backwards. I'll be testing with the multiple virtual multirail interfaces today, which I guess will test this theory. Thanks for all the help!
|
| Comment by James A Simmons [ 05/Jan/21 ] |
|
Note their is a huge difference between 2.12 and master for ksocklnd. The port of LLU-12815 is pretty nasty. |
| Comment by James A Simmons [ 05/Jan/21 ] |
|
Do we really only need a port of https://review.whamcloud.com/#/c/41056 or is the whole patch series needed? |
| Comment by Amir Shehata (Inactive) [ 06/Jan/21 ] |
|
these changes are coming into two bits:
|
| Comment by Jeff Niles [ 06/Jan/21 ] |
|
Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now. Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed? Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a |
| Comment by Andreas Dilger [ 06/Jan/21 ] |
|
Jeff, this is exactly why the socklnd conns_per_peer parameter was being added - because the single-socket performance is just unable to saturate the network on high-speed Ethernet connections. This is not a problem for o2iblnd except for OPA. |
| Comment by Amir Shehata (Inactive) [ 06/Jan/21 ] |
|
Jeff, another data point: When you switched to MR with virtual interfaces, was the load distributed to all the socklnd worker threads? If you could confirm the socklnd worker thread usage, that'll be great. thanks |
| Comment by Jeff Niles [ 06/Jan/21 ] |
|
Amir, Are you talking about the socknal_sd01_xx threads? If so, I the work did span all of them. I just swapped to using a patched server/client with |
| Comment by James A Simmons [ 06/Jan/21 ] |
|
I did a back port of the |
| Comment by Amir Shehata (Inactive) [ 06/Jan/21 ] |
|
Jeff, when you say "it only uses half", do you mean there are half the number of threads as when you configure nscheds to? If so, that's how it's suppose to work. The idea is not to consume all the cores with lnd threads, to allow other processes to use the system as well. |
| Comment by Jeff Niles [ 07/Jan/21 ] |
|
Sort of. It used 12 of the 24 configured threads. I've since reduced this, but wanted to mention what I was seeing in testing. I performed quite a few more tests today with the options ksocklnd sock_timeout=100 credits=2560 peer_credits=63 conns_per_peer=8 nscheds=12 8 conns_per_peer seemed to give the best performance, and nscheds had to be increased because I noticed that the 6 default threads were all 100% pegged during an lnet selftest. Unfortunately, this isn't reflecting in the single node IOR numbers. While we saw a ~5x increase in the lnet selftest numbers, we're only seeing a 2x increase in IOR numbers. IOR writes went from ~5GB/s to 9.8GB/s and reads went from ~1.3GB/s to 2.6GB/s on a file-per-OST test (12 OSTs, 6 OSSs). Really trying to understand the brutal read disparity; hoping you all have some thoughts. The writes seem to prove that we can push that bandwidth over the network at least, but is there something about the read path that's different from a networking perspective? |
| Comment by Amir Shehata (Inactive) [ 08/Jan/21 ] |
|
I chatted with Andreas about the read performance and he mentioned this: there is patch https://review.whamcloud.com/40347 " what the write workload is doing is very important also. If there are large numbers of threads/clients writing to the same file, or the file is very large and there are lots of cached pages, or lots of separate DLM locks, then there is more work for osc_page_gang_lookup() to do. collecting perf stats for the workload and filing an LU ticket is probably the best way to go. I would suggest trying this patch to see if there are any performance improvement. Would also be good to attach the flamegraphs we captured for both IOR writes and reads to get more eyes on it. |
| Comment by Andreas Dilger [ 09/Jan/21 ] |
|
nilesj could you share the performance results for different conns_per_peer values? It would be useful to include a table with this information in the commit message for the patch. As for the patch Amir mentioned, that was speculation regarding high CPU usage in osc_page_gang_lookup(). I can't definitively say whether that patch will help improve performance or not. Getting the flamegraphs for this would be very useful, along with what the test workload/parameters are (I'd assume IOR, but the options used are critical). |
| Comment by Jeff Niles [ 09/Jan/21 ] |
|
Here's a table of performance values using lnet_selftest at various conns_per_peer values. I assume that this changes with CPU clock speed and other factors, but it at least shows that scaling for the patch commit message. conns_per_peer setting - speed We did some more troubleshooting on our end yesterday and are suspecting some serious zfs issues. Currently testing an older ZFS version, and will comment again later after some more testing. |
| Comment by Jeff Niles [ 10/Jan/21 ] |
|
Update on where we are: When we stood up the system we ran some benchmarks on the raw block storage, so we're confident that the block storage can provide ~7GB/s read per LUN, with ~65GB/s read across the 12 LUNs in aggregate. What we did not do, however, was run any benchmarks on ZFS after the zpools were created on top of the LUN. Since LNET was no longer our bottleneck, we figured it would make sense to verify the stack from the bottom up, starting with the zpools. We set the zpools to `canmount=on` and changed the mountpoints, then mounted them and ran fio on them. Performance is terrible. Given that we have another file system running with the exact same tunings and general layout, we also checked that file system in the same manner to much the same results. Since we have past benchmarking results from that file system, we're fairly confident that at some point in the past ZFS was functioning correctly. With that knowledge (and after looking at various zfs github issues) we decided to roll back from zfs 0.8.5 to 0.7.13 to test the performance there. It seems that 0.7.13 is also providing the same results. I think that there may be potential value in rolling back our kernel to match what it was when we initialized the other file system, as there might be some odd interaction occurring with the kernel version we're running, but I'm not sure. Here's the results of our testing on a single LUN with ZFS. Keep in mind this LUN can do ~7GB/s at the block level.
And here's the really simple fio we're running to get these numbers: fio --rw=read --size 20G --bs=1M --name=something --ioengine=libaio --runtime=60s --numjobs=12 We're also noticing some issues where Lustre is eating into those numbers significantly when layered on top. We're going to hold off on debugging that at all until zfs is stable though, as it may just be due to the same zfs issues. |
| Comment by Jeff Niles [ 10/Jan/21 ] |
|
As a side note, it may make sense for us to close this particular issue and open a new, tailored one for the issues we're seeing now, since the particular issue described in this ticket (slow lnet performance) has been resolved with the `conns_per_peer` patch. Along those lines, since that patch resolved our network performance issue and we'd like to keep running 2.12 (LTS), could we lobby to get James' backport of it included in the next 2.12 point release so that we don't have to keep carrying that patch? |
| Comment by Andreas Dilger [ 11/Jan/21 ] |
|
Jeff, I definitely have some comments related to ZFS performance, but it should really go into a separate ticket. If I file that ticket, it will not be tracked correctly as a customer issue, so it is best if you do that. As for including conns_per_peer into 2.12, that is a bit tricky in the short term since that patch depends on another one that is removing the socklnd-level TCP bonding feature. While the LNet Multi-Rail provides better functionality, use_tcp_bonding may be in use at customer sites and shouldn't be removed in an LTS release without any warning. A patch will go into the next 2.12.7 LTS and 2.14.0 releases to announce that this option is deprecated, which will allow sites to become aware of this change and move over to LNet Multi-Rail. I've asked in |
| Comment by Andreas Dilger [ 11/Jan/21 ] |
|
It might make sense to keep this issue open to track the socklnd conns_per_peer feature for your use in 2.12.x, since |
| Comment by Peter Jones [ 11/Jan/21 ] |
|
Yup I agree - new ticket for the latest issues and we can leave this open until the |
| Comment by Jeff Niles [ 11/Jan/21 ] |
|
Sounds good. The new issue is LU-14320. Thanks everyone! |
| Comment by James A Simmons [ 31/Jan/22 ] |
|
The patches for |
| Comment by James A Simmons [ 02/Mar/22 ] |
|
Talking to Peter Jones this is treated as a new feature so this will not landed to 2.12 LTS. We can close this ticket. |