[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.6
Labels:
- ORNL
- ornl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.

After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.

Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).

iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.

lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.

We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html

https://jira.whamcloud.com/browse/LU-11415

https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)

Any help or suggestions would be awesome.

Thanks!

Jeff

Attachments

Issue Links

duplicates

LU-12815 Create multiple TCP sockets per SockLND

Resolved

is related to

LU-14320 Poor zfs performance (particularly reads) with ZFS 0.8.5 on RHEL 7.9

Closed

is related to

LU-14676 Better hash distribution to different CPTs when LNET router is exist

Resolved

Activity

[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet

Andreas Dilger added a comment - 09/Jan/21 1:55 PM

nilesj could you share the performance results for different conns_per_peer values? It would be useful to include a table with this information in the commit message for the patch.

As for the patch Amir mentioned, that was speculation regarding high CPU usage in osc_page_gang_lookup(). I can't definitively say whether that patch will help improve performance or not. Getting the flamegraphs for this would be very useful, along with what the test workload/parameters are (I'd assume IOR, but the options used are critical).

Andreas Dilger added a comment - 09/Jan/21 1:55 PM nilesj could you share the performance results for different conns_per_peer values? It would be useful to include a table with this information in the commit message for the patch. As for the patch Amir mentioned, that was speculation regarding high CPU usage in osc_page_gang_lookup() . I can't definitively say whether that patch will help improve performance or not. Getting the flamegraphs for this would be very useful, along with what the test workload/parameters are (I'd assume IOR, but the options used are critical).

Amir Shehata (Inactive) added a comment - 08/Jan/21 8:59 PM

I chatted with Andreas about the read performance and he mentioned this:

there is patch https://review.whamcloud.com/40347 "~~LU-9920~~ vvp: dirty pages with pagevec" that is on master, but not 2.12 yet

what the write workload is doing is very important also. If there are large numbers of threads/clients writing to the same file, or the file is very large and there are lots of cached pages, or lots of separate DLM locks, then there is more work for osc_page_gang_lookup() to do. collecting perf stats for the workload and filing an LU ticket is probably the best way to go.

I would suggest trying this patch to see if there are any performance improvement. Would also be good to attach the flamegraphs we captured for both IOR writes and reads to get more eyes on it.

Amir Shehata (Inactive) added a comment - 08/Jan/21 8:59 PM I chatted with Andreas about the read performance and he mentioned this: there is patch https://review.whamcloud.com/40347 " LU-9920 vvp: dirty pages with pagevec" that is on master, but not 2.12 yet what the write workload is doing is very important also. If there are large numbers of threads/clients writing to the same file, or the file is very large and there are lots of cached pages, or lots of separate DLM locks, then there is more work for osc_page_gang_lookup() to do. collecting perf stats for the workload and filing an LU ticket is probably the best way to go. I would suggest trying this patch to see if there are any performance improvement. Would also be good to attach the flamegraphs we captured for both IOR writes and reads to get more eyes on it.

Jeff Niles added a comment - 07/Jan/21 2:01 AM

Sort of. It used 12 of the 24 configured threads. I've since reduced this, but wanted to mention what I was seeing in testing.

I performed quite a few more tests today with the ~~LU-12815~~ patch applied and various tunings, and have some good news. With the patch, we can see nearly line rate with lnet selftest (11.5-12.0GB/s, up from ~2.5GB/s). Current tunings:

options ksocklnd sock_timeout=100 credits=2560 peer_credits=63 conns_per_peer=8 nscheds=12

8 conns_per_peer seemed to give the best performance, and nscheds had to be increased because I noticed that the 6 default threads were all 100% pegged during an lnet selftest.

Unfortunately, this isn't reflecting in the single node IOR numbers. While we saw a ~5x increase in the lnet selftest numbers, we're only seeing a 2x increase in IOR numbers. IOR writes went from ~5GB/s to 9.8GB/s and reads went from ~1.3GB/s to 2.6GB/s on a file-per-OST test (12 OSTs, 6 OSSs). Really trying to understand the brutal read disparity; hoping you all have some thoughts. The writes seem to prove that we can push that bandwidth over the network at least, but is there something about the read path that's different from a networking perspective?

Jeff Niles added a comment - 07/Jan/21 2:01 AM Sort of. It used 12 of the 24 configured threads. I've since reduced this, but wanted to mention what I was seeing in testing. I performed quite a few more tests today with the LU-12815 patch applied and various tunings, and have some good news. With the patch, we can see nearly line rate with lnet selftest (11.5-12.0GB/s, up from ~2.5GB/s). Current tunings: options ksocklnd sock_timeout=100 credits=2560 peer_credits=63 conns_per_peer=8 nscheds=12 8 conns_per_peer seemed to give the best performance, and nscheds had to be increased because I noticed that the 6 default threads were all 100% pegged during an lnet selftest. Unfortunately, this isn't reflecting in the single node IOR numbers. While we saw a ~5x increase in the lnet selftest numbers, we're only seeing a 2x increase in IOR numbers. IOR writes went from ~5GB/s to 9.8GB/s and reads went from ~1.3GB/s to 2.6GB/s on a file-per-OST test (12 OSTs, 6 OSSs). Really trying to understand the brutal read disparity; hoping you all have some thoughts. The writes seem to prove that we can push that bandwidth over the network at least, but is there something about the read path that's different from a networking perspective?

Amir Shehata (Inactive) added a comment - 06/Jan/21 8:12 PM

Jeff, when you say "it only uses half", do you mean there are half the number of threads as when you configure nscheds to? If so, that's how it's suppose to work. The idea is not to consume all the cores with lnd threads, to allow other processes to use the system as well.

Amir Shehata (Inactive) added a comment - 06/Jan/21 8:12 PM Jeff, when you say "it only uses half", do you mean there are half the number of threads as when you configure nscheds to? If so, that's how it's suppose to work. The idea is not to consume all the cores with lnd threads, to allow other processes to use the system as well.

James A Simmons added a comment - 06/Jan/21 7:44 PM

I did a back port of the ~~LU-12815~~ work for 2.12 and we have full use of our Ethernet network.

James A Simmons added a comment - 06/Jan/21 7:44 PM I did a back port of the LU-12815 work for 2.12 and we have full use of our Ethernet network.

Jeff Niles added a comment - 06/Jan/21 7:38 PM

Amir,

Are you talking about the socknal_sd01_xx threads? If so, I the work did span all of them. I just swapped to using a patched server/client with ~~LU-12815~~ included and it seems that when I had the default 6, they were all being used, but if I increase `nscheds` to 24 (just matching core count), it only uses half. Really interesting behavior.

Jeff Niles added a comment - 06/Jan/21 7:38 PM Amir, Are you talking about the socknal_sd01_xx threads? If so, I the work did span all of them. I just swapped to using a patched server/client with LU-12815 included and it seems that when I had the default 6, they were all being used, but if I increase `nscheds` to 24 (just matching core count), it only uses half. Really interesting behavior.

Amir Shehata (Inactive) added a comment - 06/Jan/21 6:11 PM

Jeff, another data point: When you switched to MR with virtual interfaces, was the load distributed to all the socklnd worker threads?
The reason I'm interested in this, is because the way work is assigned to the different CPTs is by hashing the NID. The Hash function will get us to one of the CPTs and then we pick one of the threads in that pool. If we have a single NID, we'll always get hashed into the same CPT, therefore we will not be utilizing all the worker threads. This could be another factor in the performance issue you're seeing.

If you could confirm the socklnd worker thread usage, that'll be great.

thanks

Amir Shehata (Inactive) added a comment - 06/Jan/21 6:11 PM Jeff, another data point: When you switched to MR with virtual interfaces, was the load distributed to all the socklnd worker threads? The reason I'm interested in this, is because the way work is assigned to the different CPTs is by hashing the NID. The Hash function will get us to one of the CPTs and then we pick one of the threads in that pool. If we have a single NID, we'll always get hashed into the same CPT, therefore we will not be utilizing all the worker threads. This could be another factor in the performance issue you're seeing. If you could confirm the socklnd worker thread usage, that'll be great. thanks

Andreas Dilger added a comment - 06/Jan/21 6:24 AM - edited

Jeff, this is exactly why the socklnd conns_per_peer parameter was being added - because the single-socket performance is just unable to saturate the network on high-speed Ethernet connections. This is not a problem for o2iblnd except for OPA.

Andreas Dilger added a comment - 06/Jan/21 6:24 AM - edited Jeff, this is exactly why the socklnd conns_per_peer parameter was being added - because the single-socket performance is just unable to saturate the network on high-speed Ethernet connections. This is not a problem for o2iblnd except for OPA.

Jeff Niles added a comment - 06/Jan/21 4:15 AM

Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now.

Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed?

Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a ~~LU-12815~~ patch and test with the conns-per-peer feature. This is the quickest path forward to test, rather than re-deploying without kernel bonding. Will update with how this goes tomorrow.

Jeff Niles added a comment - 06/Jan/21 4:15 AM Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now. Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed? Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a LU-12815 patch and test with the conns-per-peer feature. This is the quickest path forward to test, rather than re-deploying without kernel bonding. Will update with how this goes tomorrow.

Amir Shehata (Inactive) added a comment - 06/Jan/21 12:08 AM

these changes are coming into two bits:
1) remove socklnd bonding code since it's not really needed and it simplifies the code
2) Add the ~~LU-12815~~ patch on top of it, which adds the conns-per-peer feature

~~LU-12815~~ changes build on # 1. So unfortunately the entire series need to be ported over once it lands, if you wish to use it in 2.12.

Amir Shehata (Inactive) added a comment - 06/Jan/21 12:08 AM these changes are coming into two bits: 1) remove socklnd bonding code since it's not really needed and it simplifies the code 2) Add the LU-12815 patch on top of it, which adds the conns-per-peer feature LU-12815 changes build on # 1. So unfortunately the entire series need to be ported over once it lands, if you wish to use it in 2.12.

James A Simmons added a comment - 05/Jan/21 7:11 PM

Do we really only need a port of https://review.whamcloud.com/#/c/41056 or is the whole patch series needed?

James A Simmons added a comment - 05/Jan/21 7:11 PM Do we really only need a port of https://review.whamcloud.com/#/c/41056 or is the whole patch series needed?

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Jeff Niles

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 04/Jan/21 10:51 PM

Updated:: 02/Mar/22 5:49 PM

Resolved:: 02/Mar/22 5:49 PM