[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.6
Labels:
- ORNL
- ornl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.

After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.

Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).

iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.

lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.

We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html

https://jira.whamcloud.com/browse/LU-11415

https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)

Any help or suggestions would be awesome.

Thanks!

Jeff

Attachments

Issue Links

duplicates

LU-12815 Create multiple TCP sockets per SockLND

Resolved

is related to

LU-14320 Poor zfs performance (particularly reads) with ZFS 0.8.5 on RHEL 7.9

Closed

is related to

LU-14676 Better hash distribution to different CPTs when LNET router is exist

Resolved

Activity

[LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet

Jeff Niles added a comment - 06/Jan/21 4:15 AM

Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now.

Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed?

Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a ~~LU-12815~~ patch and test with the conns-per-peer feature. This is the quickest path forward to test, rather than re-deploying without kernel bonding. Will update with how this goes tomorrow.

Jeff Niles added a comment - 06/Jan/21 4:15 AM Just to toss a quick update out: tested the multirail virtual interface setup and can get much better rates from single node -> single node with lnet_selftest. Can't really test a full file system run without huge effort to deploy that across the system, so shelving that for now. Is this a common problem on 100G ethernet, or are there just not many 100G eth based systems deployed? Path forward: We're going to attempt to move to a 2.14 (2.13.latest I guess) server with a LU-12815 patch and test with the conns-per-peer feature. This is the quickest path forward to test, rather than re-deploying without kernel bonding. Will update with how this goes tomorrow.

Amir Shehata (Inactive) added a comment - 06/Jan/21 12:08 AM

these changes are coming into two bits:
1) remove socklnd bonding code since it's not really needed and it simplifies the code
2) Add the ~~LU-12815~~ patch on top of it, which adds the conns-per-peer feature

~~LU-12815~~ changes build on # 1. So unfortunately the entire series need to be ported over once it lands, if you wish to use it in 2.12.

Amir Shehata (Inactive) added a comment - 06/Jan/21 12:08 AM these changes are coming into two bits: 1) remove socklnd bonding code since it's not really needed and it simplifies the code 2) Add the LU-12815 patch on top of it, which adds the conns-per-peer feature LU-12815 changes build on # 1. So unfortunately the entire series need to be ported over once it lands, if you wish to use it in 2.12.

James A Simmons added a comment - 05/Jan/21 7:11 PM

Do we really only need a port of https://review.whamcloud.com/#/c/41056 or is the whole patch series needed?

James A Simmons added a comment - 05/Jan/21 7:11 PM Do we really only need a port of https://review.whamcloud.com/#/c/41056 or is the whole patch series needed?

James A Simmons added a comment - 05/Jan/21 6:32 PM

Note their is a huge difference between 2.12 and master for ksocklnd. The port of LLU-12815 is pretty nasty.

James A Simmons added a comment - 05/Jan/21 6:32 PM Note their is a huge difference between 2.12 and master for ksocklnd. The port of LLU-12815 is pretty nasty.

Jeff Niles added a comment - 05/Jan/21 2:31 PM

Amir,

The simplest test is between two nodes that reside on the same switch. CPT configuration is the default; in this case two partitions because we have two sockets on these.

> lctl get_param cpu_partition_table
cpu_partition_table=
0 : 0 2 4 6 8 10 12 14 16 18 20 22
1 : 1 3 5 7 9 11 13 15 17 19 21 23

Top output shows 6 of the 12 threads contributing, all from one socket. We tried playing with the value of nscheds, which seems to default to 6. We attempted to set it to 24 to match core count, and while we did get 24 threads, it didn't make a difference.

21751 root 20 0 0 0 0 R 20.9 0.0 49:09.76 socknal_sd00_00
 21754 root 20 0 0 0 0 S 17.9 0.0 49:17.12 socknal_sd00_03
 21756 root 20 0 0 0 0 S 17.5 0.0 49:12.60 socknal_sd00_05
 21753 root 20 0 0 0 0 S 16.9 0.0 49:12.37 socknal_sd00_02
 21752 root 20 0 0 0 0 S 16.2 0.0 49:09.85 socknal_sd00_01
 21755 root 20 0 0 0 0 S 16.2 0.0 49:14.87 socknal_sd00_04

I don't believe that ~~LU-12815~~ is the issue because when I run an lnet selftest from two or more nodes to a single node, I still only get ~2.5GB/s. Basically the bandwidth gets split across the two or more nodes and they each only see their portion of the 2.5 GB/s. My understanding of that LU is that it only helps in single connection applications; I would think that running an lnet selftest from multiple nodes to a single node would get me around that issue. Please let me know if this thinking is wrong.

That being said, my plan this morning is to test the system after completely removing the bond. I'm planning on using one single connection rather than both and will test it standalone and using MR with logical interfaces.

Andreas,

The 30/35GB/s numbers are from a system-wide IOR, so across more than a single host. I used it as an example, but to avoid expanding the scope of the ticket to include an entire cluster, I shouldn't have. To simplify things, single node IOR sees slightly less than the 2.5GB/s of an lnet selftest, so I've been focusing on single node to node performance for debugging. I guess I mentioned the system wide numbers just to state that scaling doesn't help, even with hundreds of clients.

The individual CPU usage during a node to node test is fairly balanced across the cores. We don't seem to utilize any single core more than 35%.

Command line for iperf is really basic. 6 TCP connections are needed to fully utilize the 100G link, with -P 1 producing a little over 20Gb/s. This does match with the 2.5GB/s number that we're seeing out of lnet selftest, but doesn't explain why we still only see 2.5GB/s when running a test with multiple lnet selftest "clients" to a single "server", as that should be producing multiple TCP connections. Maybe our understanding here is backwards. I'll be testing with the multiple virtual multirail interfaces today, which I guess will test this theory.

Thanks for all the help!

Jeff

Jeff Niles added a comment - 05/Jan/21 2:31 PM Amir, The simplest test is between two nodes that reside on the same switch. CPT configuration is the default; in this case two partitions because we have two sockets on these. > lctl get_param cpu_partition_table cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 Top output shows 6 of the 12 threads contributing, all from one socket. We tried playing with the value of nscheds, which seems to default to 6. We attempted to set it to 24 to match core count, and while we did get 24 threads, it didn't make a difference. 21751 root 20 0 0 0 0 R 20.9 0.0 49:09.76 socknal_sd00_00 21754 root 20 0 0 0 0 S 17.9 0.0 49:17.12 socknal_sd00_03 21756 root 20 0 0 0 0 S 17.5 0.0 49:12.60 socknal_sd00_05 21753 root 20 0 0 0 0 S 16.9 0.0 49:12.37 socknal_sd00_02 21752 root 20 0 0 0 0 S 16.2 0.0 49:09.85 socknal_sd00_01 21755 root 20 0 0 0 0 S 16.2 0.0 49:14.87 socknal_sd00_04 I don't believe that LU-12815 is the issue because when I run an lnet selftest from two or more nodes to a single node, I still only get ~2.5GB/s. Basically the bandwidth gets split across the two or more nodes and they each only see their portion of the 2.5 GB/s. My understanding of that LU is that it only helps in single connection applications; I would think that running an lnet selftest from multiple nodes to a single node would get me around that issue. Please let me know if this thinking is wrong. That being said, my plan this morning is to test the system after completely removing the bond. I'm planning on using one single connection rather than both and will test it standalone and using MR with logical interfaces. Andreas, The 30/35GB/s numbers are from a system-wide IOR, so across more than a single host. I used it as an example, but to avoid expanding the scope of the ticket to include an entire cluster, I shouldn't have. To simplify things, single node IOR sees slightly less than the 2.5GB/s of an lnet selftest, so I've been focusing on single node to node performance for debugging. I guess I mentioned the system wide numbers just to state that scaling doesn't help, even with hundreds of clients. The individual CPU usage during a node to node test is fairly balanced across the cores. We don't seem to utilize any single core more than 35%. Command line for iperf is really basic. 6 TCP connections are needed to fully utilize the 100G link, with -P 1 producing a little over 20Gb/s. This does match with the 2.5GB/s number that we're seeing out of lnet selftest, but doesn't explain why we still only see 2.5GB/s when running a test with multiple lnet selftest "clients" to a single "server", as that should be producing multiple TCP connections. Maybe our understanding here is backwards. I'll be testing with the multiple virtual multirail interfaces today, which I guess will test this theory. Thanks for all the help! Jeff

Andreas Dilger added a comment - 05/Jan/21 11:11 AM

PS: I would agree with Amir that ~~LU-12815~~ would seem like a likely candidate for fixing this problem (and fortunately has some recently-developed patches), since very high-speed Ethernet interfaces seem to have difficulty saturating a single connection. What were the command-line parameters used for iperf? In particular, was it run with -P to create multiple TCP connections?

Andreas Dilger added a comment - 05/Jan/21 11:11 AM PS: I would agree with Amir that LU-12815 would seem like a likely candidate for fixing this problem (and fortunately has some recently-developed patches), since very high-speed Ethernet interfaces seem to have difficulty saturating a single connection. What were the command-line parameters used for iperf ? In particular, was it run with -P to create multiple TCP connections?

Andreas Dilger added a comment - 05/Jan/21 10:59 AM

Maybe I'm missing something obvious/unstated here, but if you get 30G*Bytes*/s for reads and 35G*Bytes*/s for writes (I'm assuming that is client-side performance with something like IOR, but the details would be useful), that is exceeding the 200 G*bits*/s ~= 25GBytes/s network bandwidth of the server? Are there multiple servers involved in your testing? What kind? How many clients? It would be useful to add this to the Environment section of the ticket.

What is the individual CPU usage of the server during the testing (not the average across all cores)? Using TCP can be CPU hungry, and depending on how your TCP bonding is configured, it may put all of the load on a few cores, so "top" may show a 6% average CPU usage, but that is e.g. 100% of 1 of 16 cores.

Andreas Dilger added a comment - 05/Jan/21 10:59 AM Maybe I'm missing something obvious/unstated here, but if you get 30G*Bytes*/s for reads and 35G*Bytes*/s for writes (I'm assuming that is client-side performance with something like IOR, but the details would be useful), that is exceeding the 200 G*bits*/s ~= 25GBytes/s network bandwidth of the server? Are there multiple servers involved in your testing? What kind? How many clients? It would be useful to add this to the Environment section of the ticket. What is the individual CPU usage of the server during the testing (not the average across all cores)? Using TCP can be CPU hungry, and depending on how your TCP bonding is configured, it may put all of the load on a few cores, so " top " may show a 6% average CPU usage, but that is e.g. 100% of 1 of 16 cores.

Amir Shehata (Inactive) added a comment - 05/Jan/21 2:35 AM

Is the test between two nodes? What is your CPT configuration? By default it should be based on the NUMA config of the node. The CPT configuration controls the number of thread worker pools created in the socklnd. When you run your test do you see the work distributed over all the worker threads or only a subset of them? Can you share top while running a test?

Regarding ~~LU-12815~~, can you explain why you think it's not a likely cause?

We can try to verify as follows:

Would you be able to use MR instead of kernel bonding? You'd configure both the interfaces on the same LNet:

lnetctl net add --net tcp --if eth0,eth1

And then attempt to measure the performance again. If you see an improvement, then try to create multiple logical interfaces per interface and then include them all on the same LNet. Something like:

lnetctl net add --net tcp --if eth0,eth0:1,eth0:2,eth1,eth1:1,eth1:2

Doing that will create multiple sockets for read/write.

Would be interesting to see the results of this experiment.

Amir Shehata (Inactive) added a comment - 05/Jan/21 2:35 AM Is the test between two nodes? What is your CPT configuration? By default it should be based on the NUMA config of the node. The CPT configuration controls the number of thread worker pools created in the socklnd. When you run your test do you see the work distributed over all the worker threads or only a subset of them? Can you share top while running a test? Regarding LU-12815 , can you explain why you think it's not a likely cause? We can try to verify as follows: Would you be able to use MR instead of kernel bonding? You'd configure both the interfaces on the same LNet: lnetctl net add --net tcp -- if eth0,eth1 And then attempt to measure the performance again. If you see an improvement, then try to create multiple logical interfaces per interface and then include them all on the same LNet. Something like: lnetctl net add --net tcp -- if eth0,eth0:1,eth0:2,eth1,eth1:1,eth1:2 Doing that will create multiple sockets for read/write. Would be interesting to see the results of this experiment.

Peter Jones added a comment - 04/Jan/21 11:29 PM

Amir

Could you please advise?

Thanks

Peter

Peter Jones added a comment - 04/Jan/21 11:29 PM Amir Could you please advise? Thanks Peter

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Jeff Niles

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 04/Jan/21 10:51 PM

Updated:: 02/Mar/22 5:49 PM

Resolved:: 02/Mar/22 5:49 PM