[LU-6228] How to balance network connections across socknal_sd tasks? Created: 10/Feb/15  Updated: 24/Mar/18  Resolved: 24/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Major
Reporter: Rick Wagner (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Won't Fix Votes: 0
Labels: sdsc
Environment:

Linux 3.10


Attachments: File lnet-bandwidth-cdev-single.sh     Text File lnet-results-2cli.txt     Text File lnet-results-alternate-NICs-irqmap.txt     Text File lnet-results-alternate-NICs.txt     File lnet-test-2cli.sh     File lnet-test-alt-nics-irqmap.sh     File lnet-test-alt-nics.sh     Text File lst-1-to-1-conc-1-to-64.txt    
Issue Links:
Related
is related to LU-5278 ZFS - many OST watchdogs with IOR Resolved
Epic/Theme: Performance
Rank (Obsolete): 17437

 Description   

While using the ksocklnd LNET driver, I've noticed uneven load across the socknal_sd* tasks on an OSS. The number of tasks is controllable using combinations of nscheds and cpu_npartitions or cpu_pattern. I've also tried adjusting /proc/sys/lnet/portal_rotor, but this does not appear to be the right thing to try.

On a dual socket, 6 core per processor system with

$ cat ksocklnd.conf 
options ksocklnd nscheds=6 peer_credits=128 credits=1024
$ cat libcfs.conf 
options libcfs cpu_pattern="0[0,1,2,3,4,5] 1[6,7,8,9,10,11]"

there are 12 socknal_sd tasks. However, with up to 60 clients doing the same streaming IO, only 4 of the tasks will be heavily loaded (CPU time over 80%). Oddly, when running an LNET bulk_rw self test, up to 10 of the task will be loaded, and can consume 9.2 GB/s on the server's bonded 40GbE links.

What am I missing? I thought it was the mapping of TCP connections to process, but I can't seem to track them through /proc/*/fd/ and /proc/net/tcp.

I'm working from a recent pull of the master branch.



 Comments   
Comment by Rick Wagner (Inactive) [ 10/Feb/15 ]

Andreas & Gabriele, I've moved my network performance questions to separate ticket.

Comment by Jodi Levi (Inactive) [ 10/Feb/15 ]

Amir,
Would you be able to have a look at this one and comment?
Thank you!

Comment by Andreas Dilger [ 10/Feb/15 ]

The original discussion of this issue was in LU-5278:
Gabriele wrote:

Hi Rick,
take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET.
I can suggest as "gold" rule:
peer_credits=max_rpc_inflight
credits= 4x peer_credits

remember to export this value to all the cluster.

If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors.
http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

and Rick replied:

Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

Comment by Liang Zhen (Inactive) [ 11/Feb/15 ]

I think socklnd scheduler is transparent to upper layers, also, both upper layers (lnet_selftest and lustre) share same LND connections, so there should be no difference and if lnet_selftest can drive LNet hard enough and get good performance number, I tend to think this is not an issue in LNet/LND.
Also, peer_credits=128 is too high to me, I know people need value like this only if they are running Lustre over WAN, credits=1024 peer_credits=32 should be a good experiential value to start with.

Comment by Rick Wagner (Inactive) [ 12/Feb/15 ]

Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one.

Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes):

[client] $ ping 192.168.95.158 
...
64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms
64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms
[client] $ iperf -c 192.168.95.158
------------------------------------------------------------
Client connecting to 192.168.95.158, TCP port 5001
TCP window size: 92.9 KByte (default)
------------------------------------------------------------
[  3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  34.6 GBytes  29.7 Gbits/sec

When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough.

[oss] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=1024
[oss] $ grep 110 /proc/sys/lnet/peers
192.168.123.110@tcp        33    NA    -1    32    32    32     0    -9 33556736

On the client:

max_pages_per_rpc    = 1024
max_rpcs_in_flight     = 16 
[client] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=128

Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs.

After seeing the negative peer_credits, I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of.

New server settings

[server] cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=64 credits=1024

Single dd run

[client] $ dd if=stripedblob6-1 of=/dev/null bs=24M count=2048
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s

Four simultaneous dd tasks

[client] $ for i in 0 1 2 3; do dd if=stripedblob6-$i of=/dev/null bs=24M count=2048  skip=2048 & done
[1] 29932
[2] 29933
[3] 29934
[4] 29935
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s

Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.

Comment by Liang Zhen (Inactive) [ 12/Feb/15 ]

although socklnd creates three connections between any two nodes, but it only uses one as BULK_IN, one as BULK_OUT, the last one as CONTROL, which means there is only one connection (and one thread) for unidirectional dataflow, this could be the reason that no matter how many tasks/stripes you have from a single client, you always see same top performance value. However, 1.2GB/sec is kind of low even for a single connection if iperf can get 29.7Gb/sec, do you have performance number of lnet_selftest between two nodes (1:1, try concurrency from 1, 2, 4...64)?

Comment by Rick Wagner (Inactive) [ 12/Feb/15 ]

LNet self test script and results for client:server ratio of 1:1 and concurrency from 1 to 64.

During the LNet test, writes scaled from 1.5 GB/s to 2.6 GB/s (line speed) from 1 to 8 threads and then held steady. Reads, however, would stay at 1 GB/s until 8 or 16 threads and then jump to 4.5 GB/s, and go back down to 1 GB/s at 32 or 64 threads. I tried additional dd tests with 8, 16, and 32 reading tasks, but they all hit 1 GB/s and stayed there.

During the tests, credits on both the client and server went negative. I need to clear those and see that occurred during dd or just lst. If there's a way to do that without reloading the kernel modules, I'd love to know it.

[server] $ cat /proc/sys/lnet/peers 
nid                      refs state  last   max   rtr   min    tx   min queue
0@lo                        1    NA    -1     0     0     0     0     0 0
192.168.95.158@tcp          1    NA    -1    64    64    64    64    62 0
192.168.123.110@tcp         1    NA    -1    64    64    64    64    -6 0
[client] $ cat /proc/sys/lnet/peers 
nid                      refs state  last   max   rtr   min    tx   min queue
192.168.95.158@tcp          1    NA    -1    32    32    32    32   -33 0
Comment by Isaac Huang (Inactive) [ 12/Feb/15 ]

A few suggestions:

  • Please change the lst script so it'd use "check=none" instead of "check=simple".
  • Right after lst test, please do a "lctl --net tcp conn_list" on both the client and the server.
  • Please try increasing the dd bs parameter to see if it makes any difference.
  • If possible, during the lst tests, please run tcpdump to watch for TCP window sizes and MSS.
Comment by Liang Zhen (Inactive) [ 13/Feb/15 ]

Because you got different performance number on different dataflow direction, so I suspect it is because scheduler threads and softirq contended on the same cpu core. I did some tests in our lab and saw this on one node while running lnet_selftest and "mpstat -P ALL 2"

10:49:14 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
10:49:16 PM  all    0.00    0.00    2.08    0.00    0.00    1.61    0.00    0.00   96.30
10:49:16 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
......
10:49:16 PM   16    0.00    0.00   44.02    0.00    0.00   55.43    0.00    0.00    0.54

When this happen, lnet_selftest lost some performance.

Although socklnd has zero-copy send, but receiving side has no zero copy and softirq will consume significant cpu time, so I think you probably can reserve some cores for softirq and see if it can help, for example, on machine with 12 cores:

options libcfs cpu_patten="0[2-5] 1[8-11]"
options ksocklnd nscheds=4 peer_credits=32 credits=1024

and on machines with 16 cores:

options libcfs cpu_patten="0[2-7] 1[10-15]"
options ksocklnd nscheds=6 peer_credits=32 credits=1024

NB: I assume you disabled hyperthreading

and tune off irq balancer on them by:

/etc/init.d/irqbalance stop
or
service irqbalance stop

then find out irq of network interface either by running /usr/mellanox/mlnx_en/scripts/show_irq_affinity.sh
(if you have this script), or just by checking /proc/interrupts, in my environment, it's 138, then run this on 16-core machine:

echo 303 > /proc/irq/138/smp_affinity # please replace interrupt number

and this on 12 cores machine:

echo c3 > /proc/irq/138/smp_affinity # please replace interrupt number

With these setting, lustre/lnet will not use the first two cores of each socket, and hopefully OS will run softirq on these cores.

Comment by Rick Wagner (Inactive) [ 13/Feb/15 ]

Liang, I'll try your suggestions this morning, as it seems to match the odd cases where some clients can read faster from a server, and some are limited. One question: is this something I need to consider for both client and server, or just the server?

Comment by Liang Zhen (Inactive) [ 14/Feb/15 ]

Rick, I think you probably need to consider for both sides, but it's ok to start from client only and see if it can help to get better read performance, because this is mostly for receiving side.

Comment by Rick Wagner (Inactive) [ 15/Feb/15 ]

I think you're on the right track, Liang, but there's still something going on. I've attached LNet self test results where single clients read at 1 GB/s, but when two of them are reading they each get 1.6 GB/s or better. This adds to my earlier impression that each socknal_sd task has a limited capacity, and there is some threshold before additional socknal_sd tasks will pick up work.

Simple tests like Iperf do not show this result. Multiple streams of Iperf balance evenly and can saturate the 80 Gbps of bandwidth.

Some notes:

Server

Dual socket E5-2650v2 (8 core, 2.6 GHz)

Our servers have bonded Mellanox 40 GbE adapters. One adapter is attached to CPU 0, and the other to CPU 1. Part of this problem seems to be the relationship between the CPU partitions and where the network adapter is attached. I have some other results I'll post shortly that show a serious imbalance when clients read from single OSTs, depending on which NIC their data is going over.

Since we have two NIC, I wasn't sure what value to send to smp_affinity. Instead, I rebooted the systems after turning off irqbalance. I'm not sure if that was the right thing to try.

[server] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=6 peer_credits=32 credits=1024
[server] $ cat /etc/modprobe.d/libcfs.conf  
options libcfs cpu_pattern="0[2,3,4,5,6,7] 1[10,11,12,13,14,15]"
[server] $ service irqbalance status
irqbalance is stopped

Client

Dual socket E5-2680v3 (12 core, 2.5 GHz)

[client] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=10 peer_credits=32 credits=1024
[client] $ cat /etc/modprobe.d/libcfs.conf  
options libcfs cpu_pattern="0[2,3,4,5,6,7,8,9,10,11] 1[14,15,16,17,18,19,20,21,22,23]"
[client] $ service irqbalance status
irqbalance is stopped
Comment by Rick Wagner (Inactive) [ 15/Feb/15 ]

Here are LNet self test results using four pairs of clients. Two of the clients use eth2 on the server for reading and writing, and the other two clients use eth3. When two clients connect on the same NIC, both read or write between 1.5 and 2.2 GB/s (which is a higher than expected variation, but better than it gets). However, when one client is on eth2 and the other on eth3, the per-client performance goes from 650 MB/s to 2 GB/s.

Comment by Liang Zhen (Inactive) [ 15/Feb/15 ]

Rick, I think even with iperf, when there are multiple threads between a pair of nodes, different connections may have different performances, for example, when I run iperf on my testing machine, I got:

[ 18]  0.0-10.0 sec   618 MBytes   519 Mbits/sec
[  4]  0.0-10.0 sec   571 MBytes   479 Mbits/sec
[  5]  0.0-10.0 sec   580 MBytes   486 Mbits/sec
[  6]  0.0-10.0 sec   646 MBytes   542 Mbits/sec
[  8]  0.0-10.0 sec   593 MBytes   497 Mbits/sec
[ 10]  0.0-10.0 sec  1.05 GBytes   901 Mbits/sec
[ 14]  0.0-10.0 sec   728 MBytes   610 Mbits/sec
[ 15]  0.0-10.0 sec   631 MBytes   529 Mbits/sec
[ 16]  0.0-10.0 sec   521 MBytes   437 Mbits/sec
[  3]  0.0-10.0 sec   762 MBytes   639 Mbits/sec
[  9]  0.0-10.0 sec   446 MBytes   374 Mbits/sec
[ 11]  0.0-10.0 sec   253 MBytes   212 Mbits/sec
[  7]  0.0-10.0 sec   431 MBytes   361 Mbits/sec
[ 12]  0.0-10.0 sec   606 MBytes   508 Mbits/sec
[ 13]  0.0-10.0 sec   882 MBytes   739 Mbits/sec
[ 17]  0.0-10.0 sec   466 MBytes   391 Mbits/sec
[SUM]  0.0-10.0 sec  9.58 GBytes  8.22 Gbits/sec

I think this is avoidable on multiple sockets & numa system, iperf can saturate link between two nodes is because it can create many connections and threads between two nodes, even some threads are unfortunately scheduled on wrong CPU (NIC is not directly attached on it), other threads may still run on CPU that NIC is attached on, so we can see good aggregated bandwidth between two nodes.

This is different for Lustre, we can't create many threads and connections between two nodes (consume two much resource), which means we may get various performance values between different nodes:

  • softirq and socklnd scheduler are running on the same core, bad performance
  • softirq and socklnd scheduler are running on different cores, but they belongs to the same cpu socket and same numa node, good performance
  • softirq and sockldn scheduler are running on different cores, and they belong to different cpus and different numa nodes, bad performance.

I doubt if there is a perfect solution for this kind of imbalance on multiple cpus/numa system, but I think we probably can improve this by:

  • client only runs Lustre on the cpu that NIC is attached, for example
    options libcfs cpu_pattern="0 [2-7]"

    so Lustre client is only running on the first 6 cores of cpu0 which is attached by NIC. This is kind of reasonable: if client node is supposed to run other applications, then why do we want to assign all CPUs to Lustre client.

  • Server side is different because you have bonding device, so you may still see various write performance from different clients, there is an option but that requires to change network configuration of cluster, so I'm not sure it is acceptable for you because it will not allow to have bonding, so this configuration example is just FYI:
    options libcfs cpu_pattern="0[2,3,4,5,6,7] 1[10,11,12,13,14,15]"
    options networks="tcp0(eth2)[0], tcp1[eth3)[1]" # number between square brackets is cpu partition number 
    

    By this way, all data for eth2 should only be processed by cpu0, and all data for eth3 should only be processed by cpu1.

btw, I can also work out a patch to make sure socklnd can evenly dispatch the same type of connections to schedulers, but I still count more on configuration changes.

Comment by Rick Wagner (Inactive) [ 15/Feb/15 ]

Thanks, Liang. I had a similar thought about limiting the socklnd scheduler to the processor with the NICs attached, so that's clearly the optimal solution.

Breaking the bonded interface is not an option, but one of our servers has both NICs attached to CPU 0, and the HBAs all on CPU 1. The combination of manually setting the IRQ affinity and placing the socknal_sd tasks on a single partition CPU 0 has greatly improved the LNet balance. With 4 clients and a concurrency of 16, I can saturate the full 10 GB/s of the network.

This OSS has dual E5-2643v2 (3.5 GHz, 6 cores) processors. I used the Mellanox set_irq_affinity_cpulist.sh script to map one NIC to core 0, and the other to core 1.

[server] $ set_irq_affinity_cpulist.sh 0 eth0
[server] $ set_irq_affinity_cpulist.sh 1 eth1

Created a single CPU partition on cores 2, 3, 4, and 5, with 4 scheduler tasks and enough credits to drive the network (may be able to lower the peer_credits).

[server] $  cat /etc/modprobe.d/libcfs.conf
options libcfs cpu_pattern="0[2,3,4,5]"
[server] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=4 peer_credits=64 credits=1024

The only configuration on the client is for the credits.

[client] $ cat /etc/modprobe.d/ksocklnd.conf 
options ksocklnd peer_credits=32 credits=1024

While this is running, all 4 scheduler tasks are active, and evenly balanced.

 16189 root      20   0     0    0    0 R 63.0  0.0  19:34.37 socknal_sd00_02                              
 16187 root      20   0     0    0    0 R 62.3  0.0  27:18.65 socknal_sd00_00                              
 16190 root      20   0     0    0    0 S 62.3  0.0  24:15.69 socknal_sd00_03                              
 16188 root      20   0     0    0    0 R 62.0  0.0  20:49.41 socknal_sd00_01    

If this is the correct hardware and LNet configuration, we can adjust the other server. The next step will be getting the real data performance to the clients. I've started testing that, but haven't hit the limit of the storage.

I will follow up with some example results for feedback on tuning and setting up the performance test.

Comment by Liang Zhen (Inactive) [ 16/Feb/15 ]

Rick, it's good to see you can saturate network by this configuration, but I'd suggest to do more tests before changing other servers.
When NICs and HBAs are on different CPUs, I think it's unfortunately unavoidable to have remote NUMA memory access, either for backend filesystem or network, please check these slides for more details:

Lustre 2.0 and NUMIOA architectures
High Performance I/O with NUMA Systems in Linux

From these slides, the optimal case needs to have two subsets:

 {CPU0, eth0, target[0, 2, ...]}
 {CPU1, eth1, targets[1, 3, ...]}. 

However, because you have to use bonding and can't separate NIC, so you may have to try these options (all cases are assuming both NICs on CPU0):

  • (Please ignore this one if it's impossible to change HW configuration in this way) Is it possible to attach all NICs and HBAs on CPU0 and configure Lustre to only run non-IO-tensive threads on CPU1? By this way, whole IO-data path is local to CPU0, however, the concern here is, CPU0 could be performance bottleneck. Just in case you want to make a try, I still post an example at here:
    options libcfs cpu_pattern="0[2-5] 1[6-12]" # use all cores of the second CPU because both NICs are on CPU0
    options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
    options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1
    
  • NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU0, remote numa memory for IO service
    configuration example
    options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
    options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
    options ost oss_io_cpts="[0]" oss_cpts="[1]" # IO-tensive service on CPU0, non-IO-tensive service on CPU1
    
  • NICs are attached on CPU0, HBAs are on CPU1, Run IO service on CPU1, remote numa memory access for LNet
    options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
    options lnet networks="tcp(bond0)[0]"  # all network requests are handled on CPU0
    options ost oss_io_cpts="[1]" oss_cpts="[0]" # IO-tensive service on CPU1, non-IO-tensive service on CPU0
    
  • NICs are attached on CPU0, HBAs are on CPU1, don't bind services, but turn on portal rotor which will dispatch requests to service threads on different CPUs.
    options libcfs cpu_pattern="0[2-5] 1[6-11]" # use all cores of the second CPU because both NICs are on CPU0
    options lnet networks="tcp(bond0)[0]"  portal_rotor=1 # all network requests are handled on CPU0, but they will be dispatched to upper layer threads on all CPUs
    

I think all these configuration should have the same lnet performance as you can get now, but they may have different Lustre IO performance.

Comment by Rick Wagner (Inactive) [ 17/Feb/15 ]

Liang, thanks for your suggestions, I started working through the options and came up with a solution that should for us. With what I'm about to describe, I reliably streamed files at 7.2 to 7.4 GB/s to 12 clients, with each client reading 8 files. I think there's room for improvement in the performance, and certainly in reducing the number of clients, but this was repeatable and it's a lot of progress.

First, I made a mistake about the placement of the HBAs: two of them are on CPU0 with the NICs. All of this was on the server with dual Intel E5-2650v2 processors (8 core, 2.6 GHz). In ASCII art, the PCI layout looks like this:

CPU0
  |---> 40GbE
  |---> 40GbE
  |---> HBA (10 drives)
  |---> HBA (25 drives)

CPU1
  |---> HBA (25 drives)

We have the freedom to move cards around (somewhat), but not to break the network bonding. The ZFS zpools are configured as raidz2 8+2, with one 10 drive pool spanning the 25 drive HBAs on CPU0 and CPU1.

What I found was that restricting the ksocklnd tasks to CPU0 had the biggest impact, and that it was better to let the other tasks run on both CPU0 and CPU1. Here are the configuration files from the servers:

[server] $ cat /etc/modprobe.d/libcfs.conf
options libcfs cpu_pattern="0[2-7] 1[8-15]"
[server] $ /etc/modprobe.d/lnet.conf
options lnet networks="tcp(bond0)[0]"
[server] $ /etc/modprobe.d/ksocklnd.conf
options ksocklnd nscheds=6 peer_credits=24 credits=1024

Moving the various oss tasks to partition 0 or 1 did not help, more than likely because the topology does not match what I described originally.

The client configuration is minimal, with the only change being setting max_rpcs_in_flight to 16.

[client] $ cat lnet.conf
options lnet networks="tcp(ib0)"
[client] $ cat ksocklnd.conf
options ksocklnd peer_credits=32 credits=1024
[client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_rpcs_in_flight 
16
[client] $ cat /proc/fs/lustre/osc/ddragon-OST0000-osc-*/max_pages_per_rpc  
256

You'll note that the number of credits and RPCs in flight did not need to be very high. I attribute this with a relatively low bandwidth-delay product (10 GB/s x 0.1 ms = 1 MB). I tested a larger number of maximum pages, but it drove down performance. I need to revisit that, since it could related to the BDP, the ZFS record size (also 1 MB), or it could be improved with the ZFS tuning I did.

One thing that surprised me was that setting the IRQ affinity for the Mellanox NICs reduced performance. However, it was still better to restrict the CPU partion on NUMA node 0 to cores [2-7].

[server] $ show_irq_affinity.sh eth2
126: 000000,00000000,00000000,000000ff
127: 000000,00000000,00000000,000000ff
128: 000000,00000000,00000000,000000ff
...

The last thing that help get the performance up was to improve chances for ZFS to prefetch data. While testing, I did an experiment to differentiate between the impact of the networking and ZFS, and had several (~10) clients read the same 64 GiB file from an OST. This was chosen to match the maximum of the ZFS ARC, plus whatever caches Lustre had. When doing this, the server bandwidth was saturated at 10 GB/s, and showed that getting data from the drives to memory was critical, even if the data was across the QPI link.

The branch of ZFS I'm using sets most of the tuning parameters to 0, and the important one was zfs_vdev_cache_size. My reading of random blog posts indicates that this impacts prefetch from the DMU.

[server] $ cat /etc/modprobe.d/zfs.conf
options zfs zfs_vdev_cache_size=1310720
options zfs zfs_vdev_cache_max=131072

Regardless, this immediately improved the rate at which the zpools could deliver data.

This is a bit of a long comment because I wanted to capture a lot of the details. If you see anything worth examining given my corrected information, please let me know. Our next step from here is to try incorporating the patches we're using into a stable release, and retesting with the Linux 2.6 kernel, or with the EPEL 3.10 kernel-lt package.

Comment by Liang Zhen (Inactive) [ 18/Feb/15 ]

Rick, sounds good, I only have one suggestion, because now you are binding network on cpu0, which means cpu0 could be overloaded, so it still be nice if you can somehow offload cpu0 by bind some non-IO services on cpu1 and see if it can perform well.

# please use your current libcfs and lnet options at here
options ost oss_cpts="[1]"
options ptlrpc ldlm_cpts="[1]"

unless you lose some performance with this setting, otherwise I'd suggest to use it because cpu1 can take over workload from cpu0 by this way.

Comment by Rick Wagner (Inactive) [ 19/Feb/15 ]

Liang, I tested this oss and ptlrpc options together and separately, and doing this took the performance from over 7 GB/s down to 4 GB/s or less. My guess is that CPU0 has the capacity to handle some of these tasks, and it's better to let it do that when it can.

Comment by Peter Jones [ 24/Mar/18 ]

I don't think that any further work is needed here

Generated at Sat Feb 10 01:58:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.