[LU-6228] How to balance network connections across socknal_sd tasks? - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- sdsc
Environment:
Linux 3.10

Epic/Theme:
- Performance
Rank (Obsolete):
17437

Description

While using the ksocklnd LNET driver, I've noticed uneven load across the socknal_sd* tasks on an OSS. The number of tasks is controllable using combinations of nscheds and cpu_npartitions or cpu_pattern. I've also tried adjusting /proc/sys/lnet/portal_rotor, but this does not appear to be the right thing to try.

On a dual socket, 6 core per processor system with

$ cat ksocklnd.conf 
options ksocklnd nscheds=6 peer_credits=128 credits=1024
$ cat libcfs.conf 
options libcfs cpu_pattern="0[0,1,2,3,4,5] 1[6,7,8,9,10,11]"

there are 12 socknal_sd tasks. However, with up to 60 clients doing the same streaming IO, only 4 of the tasks will be heavily loaded (CPU time over 80%). Oddly, when running an LNET bulk_rw self test, up to 10 of the task will be loaded, and can consume 9.2 GB/s on the server's bonded 40GbE links.

What am I missing? I thought it was the mapping of TCP connections to process, but I can't seem to track them through /proc/*/fd/ and /proc/net/tcp.

I'm working from a recent pull of the master branch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lnet-bandwidth-cdev-single.sh
1 kB
12/Feb/15 6:59 PM
lnet-results-2cli.txt
8 kB
15/Feb/15 4:19 AM
lnet-results-alternate-NICs.txt
3 kB
15/Feb/15 5:08 AM
lnet-results-alternate-NICs-irqmap.txt
12 kB
15/Feb/15 11:05 PM
lnet-test-2cli.sh
1 kB
15/Feb/15 4:19 AM
lnet-test-alt-nics.sh
1 kB
15/Feb/15 5:08 AM
lnet-test-alt-nics-irqmap.sh
1 kB
15/Feb/15 11:05 PM
lst-1-to-1-conc-1-to-64.txt
17 kB
12/Feb/15 6:59 PM

Issue Links

is related to

LU-5278 ZFS - many OST watchdogs with IOR

Resolved

Activity

[LU-6228] How to balance network connections across socknal_sd tasks?

Peter Jones added a comment - 24/Mar/18 2:01 PM

I don't think that any further work is needed here

Peter Jones added a comment - 24/Mar/18 2:01 PM I don't think that any further work is needed here

Rick Wagner (Inactive) added a comment - 19/Feb/15 1:23 AM

Liang, I tested this oss and ptlrpc options together and separately, and doing this took the performance from over 7 GB/s down to 4 GB/s or less. My guess is that CPU0 has the capacity to handle some of these tasks, and it's better to let it do that when it can.

Rick Wagner (Inactive) added a comment - 19/Feb/15 1:23 AM Liang, I tested this oss and ptlrpc options together and separately, and doing this took the performance from over 7 GB/s down to 4 GB/s or less. My guess is that CPU0 has the capacity to handle some of these tasks, and it's better to let it do that when it can.

Liang Zhen (Inactive) added a comment - 18/Feb/15 8:24 AM

Rick, sounds good, I only have one suggestion, because now you are binding network on cpu0, which means cpu0 could be overloaded, so it still be nice if you can somehow offload cpu0 by bind some non-IO services on cpu1 and see if it can perform well.

# please use your current libcfs and lnet options at here
options ost oss_cpts="[1]"
options ptlrpc ldlm_cpts="[1]"

unless you lose some performance with this setting, otherwise I'd suggest to use it because cpu1 can take over workload from cpu0 by this way.

Liang Zhen (Inactive) added a comment - 18/Feb/15 8:24 AM Rick, sounds good, I only have one suggestion, because now you are binding network on cpu0, which means cpu0 could be overloaded, so it still be nice if you can somehow offload cpu0 by bind some non-IO services on cpu1 and see if it can perform well. # please use your current libcfs and lnet options at here options ost oss_cpts="[1]" options ptlrpc ldlm_cpts="[1]" unless you lose some performance with this setting, otherwise I'd suggest to use it because cpu1 can take over workload from cpu0 by this way.

Rick Wagner (Inactive) added a comment - 17/Feb/15 11:40 PM

Liang, thanks for your suggestions, I started working through the options and came up with a solution that should for us. With what I'm about to describe, I reliably streamed files at 7.2 to 7.4 GB/s to 12 clients, with each client reading 8 files. I think there's room for improvement in the performance, and certainly in reducing the number of clients, but this was repeatable and it's a lot of progress.

First, I made a mistake about the placement of the HBAs: two of them are on CPU0 with the NICs. All of this was on the server with dual Intel E5-2650v2 processors (8 core, 2.6 GHz). In ASCII art, the PCI layout looks like this:

CPU0
  |---> 40GbE
  |---> 40GbE
  |---> HBA (10 drives)
  |---> HBA (25 drives)

CPU1
  |---> HBA (25 drives)

We have the freedom to move cards around (somewhat), but not to break the network bonding. The ZFS zpools are configured as raidz2 8+2, with one 10 drive pool spanning the 25 drive HBAs on CPU0 and CPU1.

What I found was that restricting the ksocklnd tasks to CPU0 had the biggest impact, and that it was better to let the other tasks run on both CPU0 and CPU1. Here are the configuration files from the servers:

[server] $ cat /etc/modprobe.d/libcfs.conf
options libcfs cpu_pattern="0[2-7] 1[8-15]"
[server] $ /etc/modprobe.d/lnet.conf
options lnet networks="tcp(bond0)[0]"
[server] $ /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=6 peer_credits=24 credits=1024

Moving the various oss tasks to partition 0 or 1 did not help, more than likely because the topology does not match what I described originally.

The client configuration is minimal, with the only change being setting max_rpcs_in_flight to 16.

[client] $ cat lnet.conf
options lnet networks="tcp(ib0)"
[client] $ cat ksocklnd.conf
options ksocklnd peer_credits=32 credits=1024
[client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_rpcs_in_flight 
16
[client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_pages_per_rpc  
256

You'll note that the number of credits and RPCs in flight did not need to be very high. I attribute this with a relatively low bandwidth-delay product (10 GB/s x 0.1 ms = 1 MB). I tested a larger number of maximum pages, but it drove down performance. I need to revisit that, since it could related to the BDP, the ZFS record size (also 1 MB), or it could be improved with the ZFS tuning I did.

One thing that surprised me was that setting the IRQ affinity for the Mellanox NICs reduced performance. However, it was still better to restrict the CPU partion on NUMA node 0 to cores [2-7].

[server] $ show_irq_affinity.sh eth2
126: 000000,00000000,00000000,000000ff
127: 000000,00000000,00000000,000000ff
128: 000000,00000000,00000000,000000ff
...

The last thing that help get the performance up was to improve chances for ZFS to prefetch data. While testing, I did an experiment to differentiate between the impact of the networking and ZFS, and had several (~10) clients read the same 64 GiB file from an OST. This was chosen to match the maximum of the ZFS ARC, plus whatever caches Lustre had. When doing this, the server bandwidth was saturated at 10 GB/s, and showed that getting data from the drives to memory was critical, even if the data was across the QPI link.

The branch of ZFS I'm using sets most of the tuning parameters to 0, and the important one was zfs_vdev_cache_size. My reading of random blog posts indicates that this impacts prefetch from the DMU.

[server] $ cat /etc/modprobe.d/zfs.conf
options zfs zfs_vdev_cache_size=1310720
options zfs zfs_vdev_cache_max=131072

Regardless, this immediately improved the rate at which the zpools could deliver data.

This is a bit of a long comment because I wanted to capture a lot of the details. If you see anything worth examining given my corrected information, please let me know. Our next step from here is to try incorporating the patches we're using into a stable release, and retesting with the Linux 2.6 kernel, or with the EPEL 3.10 kernel-lt package.

Rick Wagner (Inactive) added a comment - 17/Feb/15 11:40 PM Liang, thanks for your suggestions, I started working through the options and came up with a solution that should for us. With what I'm about to describe, I reliably streamed files at 7.2 to 7.4 GB/s to 12 clients, with each client reading 8 files. I think there's room for improvement in the performance, and certainly in reducing the number of clients, but this was repeatable and it's a lot of progress. First, I made a mistake about the placement of the HBAs: two of them are on CPU0 with the NICs. All of this was on the server with dual Intel E5-2650v2 processors (8 core, 2.6 GHz). In ASCII art, the PCI layout looks like this: CPU0 |---> 40GbE |---> 40GbE |---> HBA (10 drives) |---> HBA (25 drives) CPU1 |---> HBA (25 drives) We have the freedom to move cards around (somewhat), but not to break the network bonding. The ZFS zpools are configured as raidz2 8+2, with one 10 drive pool spanning the 25 drive HBAs on CPU0 and CPU1. What I found was that restricting the ksocklnd tasks to CPU0 had the biggest impact, and that it was better to let the other tasks run on both CPU0 and CPU1. Here are the configuration files from the servers: [server] $ cat /etc/modprobe.d/libcfs.conf options libcfs cpu_pattern="0[2-7] 1[8-15]" [server] $ /etc/modprobe.d/lnet.conf options lnet networks="tcp(bond0)[0]" [server] $ /etc/modprobe.d/ksocklnd.conf options ksocklnd nscheds=6 peer_credits=24 credits=1024 Moving the various oss tasks to partition 0 or 1 did not help, more than likely because the topology does not match what I described originally. The client configuration is minimal, with the only change being setting max_rpcs_in_flight to 16. [client] $ cat lnet.conf options lnet networks="tcp(ib0)" [client] $ cat ksocklnd.conf options ksocklnd peer_credits=32 credits=1024 [client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_rpcs_in_flight 16 [client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_pages_per_rpc 256 You'll note that the number of credits and RPCs in flight did not need to be very high. I attribute this with a relatively low bandwidth-delay product (10 GB/s x 0.1 ms = 1 MB). I tested a larger number of maximum pages, but it drove down performance. I need to revisit that, since it could related to the BDP, the ZFS record size (also 1 MB), or it could be improved with the ZFS tuning I did. One thing that surprised me was that setting the IRQ affinity for the Mellanox NICs reduced performance. However, it was still better to restrict the CPU partion on NUMA node 0 to cores [2-7] . [server] $ show_irq_affinity.sh eth2 126: 000000,00000000,00000000,000000ff 127: 000000,00000000,00000000,000000ff 128: 000000,00000000,00000000,000000ff ... The last thing that help get the performance up was to improve chances for ZFS to prefetch data. While testing, I did an experiment to differentiate between the impact of the networking and ZFS, and had several (~10) clients read the same 64 GiB file from an OST. This was chosen to match the maximum of the ZFS ARC, plus whatever caches Lustre had. When doing this, the server bandwidth was saturated at 10 GB/s, and showed that getting data from the drives to memory was critical, even if the data was across the QPI link. The branch of ZFS I'm using sets most of the tuning parameters to 0, and the important one was zfs_vdev_cache_size . My reading of random blog posts indicates that this impacts prefetch from the DMU. [server] $ cat /etc/modprobe.d/zfs.conf options zfs zfs_vdev_cache_size=1310720 options zfs zfs_vdev_cache_max=131072 Regardless, this immediately improved the rate at which the zpools could deliver data. This is a bit of a long comment because I wanted to capture a lot of the details. If you see anything worth examining given my corrected information, please let me know. Our next step from here is to try incorporating the patches we're using into a stable release, and retesting with the Linux 2.6 kernel, or with the EPEL 3.10 kernel-lt package.

Liang Zhen (Inactive) added a comment - 16/Feb/15 8:37 AM

Rick, it's good to see you can saturate network by this configuration, but I'd suggest to do more tests before changing other servers.
When NICs and HBAs are on different CPUs, I think it's unfortunately unavoidable to have remote NUMA memory access, either for backend filesystem or network, please check these slides for more details:

Lustre 2.0 and NUMIOA architectures
High Performance I/O with NUMA Systems in Linux

From these slides, the optimal case needs to have two subsets:

 {CPU0, eth0, target[0, 2, ...]}
 {CPU1, eth1, targets[1, 3, ...]}.

However, because you have to use bonding and can't separate NIC, so you may have to try these options (all cases are assuming both NICs on CPU0):

(Please ignore this one if it's impossible to change HW configuration in this way) Is it possible to attach all NICs and HBAs on CPU0 and configure Lustre to only run non-IO-tensive threads on CPU1? By this way, whole IO-data path is local to CPU0, however, the concern here is, CPU0 could be performance bottleneck. Just in case you want to make a try, I still post an example at here:
```
options libcfs cpu_pattern="0[2-5] 1[6-12]" # use all cores of the second CPU because both NICs are on CPU0
options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1
```

NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU0, remote numa memory for IO service
configuration example

options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1

NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU1, remote numa memory access for LNet

options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
options ost oss_io_cpts="[1]" oss_cpts="[0]" # IO-tensive service on CPU1, non-IO-tensive service on CPU0

NICs are attached on CPU0, HBAs are on CPU1, don't bind services, but turn on portal rotor which will dispatch requests to service threads on different CPUs.

options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
options lnet networks="tcp(bond0)[0]"  portal_rotor=1 # all network requests are handled on CPU0, but they will be dispatched to upper layer threads on all CPUs

I think all these configuration should have the same lnet performance as you can get now, but they may have different Lustre IO performance.

Liang Zhen (Inactive) added a comment - 16/Feb/15 8:37 AM Rick, it's good to see you can saturate network by this configuration, but I'd suggest to do more tests before changing other servers. When NICs and HBAs are on different CPUs, I think it's unfortunately unavoidable to have remote NUMA memory access, either for backend filesystem or network, please check these slides for more details: Lustre 2.0 and NUMIOA architectures High Performance I/O with NUMA Systems in Linux From these slides, the optimal case needs to have two subsets: {CPU0, eth0, target[0, 2, ...]} {CPU1, eth1, targets[1, 3, ...]}. However, because you have to use bonding and can't separate NIC, so you may have to try these options (all cases are assuming both NICs on CPU0): (Please ignore this one if it's impossible to change HW configuration in this way) Is it possible to attach all NICs and HBAs on CPU0 and configure Lustre to only run non-IO-tensive threads on CPU1? By this way, whole IO-data path is local to CPU0, however, the concern here is, CPU0 could be performance bottleneck. Just in case you want to make a try, I still post an example at here: options libcfs cpu_pattern="0[2-5] 1[6-12]" # use all cores of the second CPU because both NICs are on CPU0 options lnet networks="tcp(bond0)[0]" # all network requests are handled on CPU0 options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1 NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU0, remote numa memory for IO service configuration example options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0 options lnet networks="tcp(bond0)[0]" # all network requests are handled on CPU0 options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1 NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU1, remote numa memory access for LNet options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0 options lnet networks="tcp(bond0)[0]" # all network requests are handled on CPU0 options ost oss_io_cpts="[1]" oss_cpts="[0]" # IO-tensive service on CPU1, non-IO-tensive service on CPU0 NICs are attached on CPU0, HBAs are on CPU1, don't bind services, but turn on portal rotor which will dispatch requests to service threads on different CPUs. options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0 options lnet networks="tcp(bond0)[0]" portal_rotor=1 # all network requests are handled on CPU0, but they will be dispatched to upper layer threads on all CPUs I think all these configuration should have the same lnet performance as you can get now, but they may have different Lustre IO performance.

Rick Wagner (Inactive) added a comment - 15/Feb/15 11:05 PM

Thanks, Liang. I had a similar thought about limiting the socklnd scheduler to the processor with the NICs attached, so that's clearly the optimal solution.

Breaking the bonded interface is not an option, but one of our servers has both NICs attached to CPU 0, and the HBAs all on CPU 1. The combination of manually setting the IRQ affinity and placing the socknal_sd tasks on a single partition CPU 0 has greatly improved the LNet balance. With 4 clients and a concurrency of 16, I can saturate the full 10 GB/s of the network.

This OSS has dual E5-2643v2 (3.5 GHz, 6 cores) processors. I used the Mellanox set_irq_affinity_cpulist.sh script to map one NIC to core 0, and the other to core 1.

[server] $ set_irq_affinity_cpulist.sh 0 eth0
[server] $ set_irq_affinity_cpulist.sh 1 eth1

Created a single CPU partition on cores 2, 3, 4, and 5, with 4 scheduler tasks and enough credits to drive the network (may be able to lower the peer_credits).

[server] $  cat /etc/modprobe.d/libcfs.conf
options libcfs cpu_pattern="0[2,3,4,5]"
[server] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=4 peer_credits=64 credits=1024

The only configuration on the client is for the credits.

[client] $ cat /etc/modprobe.d/ksocklnd.conf 
options ksocklnd peer_credits=32 credits=1024

While this is running, all 4 scheduler tasks are active, and evenly balanced.

 16189 root      20   0     0    0    0 R 63.0  0.0  19:34.37 socknal_sd00_02                              
 16187 root      20   0     0    0    0 R 62.3  0.0  27:18.65 socknal_sd00_00                              
 16190 root      20   0     0    0    0 S 62.3  0.0  24:15.69 socknal_sd00_03                              
 16188 root      20   0     0    0    0 R 62.0  0.0  20:49.41 socknal_sd00_01

If this is the correct hardware and LNet configuration, we can adjust the other server. The next step will be getting the real data performance to the clients. I've started testing that, but haven't hit the limit of the storage.

I will follow up with some example results for feedback on tuning and setting up the performance test.

Rick Wagner (Inactive) added a comment - 15/Feb/15 11:05 PM Thanks, Liang. I had a similar thought about limiting the socklnd scheduler to the processor with the NICs attached, so that's clearly the optimal solution. Breaking the bonded interface is not an option, but one of our servers has both NICs attached to CPU 0, and the HBAs all on CPU 1. The combination of manually setting the IRQ affinity and placing the socknal_sd tasks on a single partition CPU 0 has greatly improved the LNet balance. With 4 clients and a concurrency of 16, I can saturate the full 10 GB/s of the network. This OSS has dual E5-2643v2 (3.5 GHz, 6 cores) processors. I used the Mellanox set_irq_affinity_cpulist.sh script to map one NIC to core 0, and the other to core 1. [server] $ set_irq_affinity_cpulist.sh 0 eth0 [server] $ set_irq_affinity_cpulist.sh 1 eth1 Created a single CPU partition on cores 2, 3, 4, and 5, with 4 scheduler tasks and enough credits to drive the network (may be able to lower the peer_credits). [server] $ cat /etc/modprobe.d/libcfs.conf options libcfs cpu_pattern="0[2,3,4,5]" [server] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd nscheds=4 peer_credits=64 credits=1024 The only configuration on the client is for the credits. [client] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=32 credits=1024 While this is running, all 4 scheduler tasks are active, and evenly balanced. 16189 root 20 0 0 0 0 R 63.0 0.0 19:34.37 socknal_sd00_02 16187 root 20 0 0 0 0 R 62.3 0.0 27:18.65 socknal_sd00_00 16190 root 20 0 0 0 0 S 62.3 0.0 24:15.69 socknal_sd00_03 16188 root 20 0 0 0 0 R 62.0 0.0 20:49.41 socknal_sd00_01 If this is the correct hardware and LNet configuration, we can adjust the other server. The next step will be getting the real data performance to the clients. I've started testing that, but haven't hit the limit of the storage. I will follow up with some example results for feedback on tuning and setting up the performance test.

How to balance network connections across socknal_sd tasks?

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates