[LU-6228] How to balance network connections across socknal_sd tasks? - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- sdsc
Environment:
Linux 3.10

Epic/Theme:
- Performance
Rank (Obsolete):
17437

Description

While using the ksocklnd LNET driver, I've noticed uneven load across the socknal_sd* tasks on an OSS. The number of tasks is controllable using combinations of nscheds and cpu_npartitions or cpu_pattern. I've also tried adjusting /proc/sys/lnet/portal_rotor, but this does not appear to be the right thing to try.

On a dual socket, 6 core per processor system with

$ cat ksocklnd.conf 
options ksocklnd nscheds=6 peer_credits=128 credits=1024
$ cat libcfs.conf 
options libcfs cpu_pattern="0[0,1,2,3,4,5] 1[6,7,8,9,10,11]"

there are 12 socknal_sd tasks. However, with up to 60 clients doing the same streaming IO, only 4 of the tasks will be heavily loaded (CPU time over 80%). Oddly, when running an LNET bulk_rw self test, up to 10 of the task will be loaded, and can consume 9.2 GB/s on the server's bonded 40GbE links.

What am I missing? I thought it was the mapping of TCP connections to process, but I can't seem to track them through /proc/*/fd/ and /proc/net/tcp.

I'm working from a recent pull of the master branch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lnet-bandwidth-cdev-single.sh
1 kB
12/Feb/15 6:59 PM
lnet-results-2cli.txt
8 kB
15/Feb/15 4:19 AM
lnet-results-alternate-NICs.txt
3 kB
15/Feb/15 5:08 AM
lnet-results-alternate-NICs-irqmap.txt
12 kB
15/Feb/15 11:05 PM
lnet-test-2cli.sh
1 kB
15/Feb/15 4:19 AM
lnet-test-alt-nics.sh
1 kB
15/Feb/15 5:08 AM
lnet-test-alt-nics-irqmap.sh
1 kB
15/Feb/15 11:05 PM
lst-1-to-1-conc-1-to-64.txt
17 kB
12/Feb/15 6:59 PM

Issue Links

is related to

LU-5278 ZFS - many OST watchdogs with IOR

Resolved

Activity

[LU-6228] How to balance network connections across socknal_sd tasks?

Isaac Huang (Inactive) added a comment - 12/Feb/15 8:38 PM

A few suggestions:

Please change the lst script so it'd use "check=none" instead of "check=simple".
Right after lst test, please do a "lctl --net tcp conn_list" on both the client and the server.
Please try increasing the dd bs parameter to see if it makes any difference.
If possible, during the lst tests, please run tcpdump to watch for TCP window sizes and MSS.

Isaac Huang (Inactive) added a comment - 12/Feb/15 8:38 PM A few suggestions: Please change the lst script so it'd use "check=none" instead of "check=simple" . Right after lst test, please do a "lctl --net tcp conn_list" on both the client and the server. Please try increasing the dd bs parameter to see if it makes any difference. If possible, during the lst tests, please run tcpdump to watch for TCP window sizes and MSS.

Rick Wagner (Inactive) added a comment - 12/Feb/15 6:59 PM - edited

LNet self test script and results for client:server ratio of 1:1 and concurrency from 1 to 64.

During the LNet test, writes scaled from 1.5 GB/s to 2.6 GB/s (line speed) from 1 to 8 threads and then held steady. Reads, however, would stay at 1 GB/s until 8 or 16 threads and then jump to 4.5 GB/s, and go back down to 1 GB/s at 32 or 64 threads. I tried additional dd tests with 8, 16, and 32 reading tasks, but they all hit 1 GB/s and stayed there.

During the tests, credits on both the client and server went negative. I need to clear those and see that occurred during dd or just lst. If there's a way to do that without reloading the kernel modules, I'd love to know it.

[server] $ cat /proc/sys/lnet/peers 
nid                      refs state  last   max   rtr   min    tx   min queue
0@lo                        1    NA    -1     0     0     0     0     0 0
192.168.95.158@tcp          1    NA    -1    64    64    64    64    62 0
192.168.123.110@tcp         1    NA    -1    64    64    64    64    -6 0
[client] $ cat /proc/sys/lnet/peers 
nid                      refs state  last   max   rtr   min    tx   min queue
192.168.95.158@tcp          1    NA    -1    32    32    32    32   -33 0

Rick Wagner (Inactive) added a comment - 12/Feb/15 6:59 PM - edited LNet self test script and results for client:server ratio of 1:1 and concurrency from 1 to 64. During the LNet test, writes scaled from 1.5 GB/s to 2.6 GB/s (line speed) from 1 to 8 threads and then held steady. Reads, however, would stay at 1 GB/s until 8 or 16 threads and then jump to 4.5 GB/s, and go back down to 1 GB/s at 32 or 64 threads. I tried additional dd tests with 8, 16, and 32 reading tasks, but they all hit 1 GB/s and stayed there. During the tests, credits on both the client and server went negative. I need to clear those and see that occurred during dd or just lst . If there's a way to do that without reloading the kernel modules, I'd love to know it. [server] $ cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 0@lo 1 NA -1 0 0 0 0 0 0 192.168.95.158@tcp 1 NA -1 64 64 64 64 62 0 192.168.123.110@tcp 1 NA -1 64 64 64 64 -6 0 [client] $ cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 192.168.95.158@tcp 1 NA -1 32 32 32 32 -33 0

Liang Zhen (Inactive) added a comment - 12/Feb/15 3:31 PM

although socklnd creates three connections between any two nodes, but it only uses one as BULK_IN, one as BULK_OUT, the last one as CONTROL, which means there is only one connection (and one thread) for unidirectional dataflow, this could be the reason that no matter how many tasks/stripes you have from a single client, you always see same top performance value. However, 1.2GB/sec is kind of low even for a single connection if iperf can get 29.7Gb/sec, do you have performance number of lnet_selftest between two nodes (1:1, try concurrency from 1, 2, 4...64)?

Liang Zhen (Inactive) added a comment - 12/Feb/15 3:31 PM although socklnd creates three connections between any two nodes, but it only uses one as BULK_IN, one as BULK_OUT, the last one as CONTROL, which means there is only one connection (and one thread) for unidirectional dataflow, this could be the reason that no matter how many tasks/stripes you have from a single client, you always see same top performance value. However, 1.2GB/sec is kind of low even for a single connection if iperf can get 29.7Gb/sec, do you have performance number of lnet_selftest between two nodes (1:1, try concurrency from 1, 2, 4...64)?

Rick Wagner (Inactive) added a comment - 12/Feb/15 4:02 AM

Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one.

Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes):

[client] $ ping 192.168.95.158 
...
64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms
64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms
[client] $ iperf -c 192.168.95.158
------------------------------------------------------------
Client connecting to 192.168.95.158, TCP port 5001
TCP window size: 92.9 KByte (default)
------------------------------------------------------------
[  3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  34.6 GBytes  29.7 Gbits/sec

When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough.

[oss] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=1024
[oss] $ grep 110 /proc/sys/lnet/peers
192.168.123.110@tcp        33    NA    -1    32    32    32     0    -9 33556736

On the client:

max_pages_per_rpc    = 1024
max_rpcs_in_flight     = 16 
[client] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=128

Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs.

After seeing the negative peer_credits, I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of.

New server settings

[server] cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=64 credits=1024

Single dd run

[client] $ dd if=stripedblob6-1 of=/dev/null bs=24M count=2048
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s

Four simultaneous dd tasks

[client] $ for i in 0 1 2 3; do dd if=stripedblob6-$i of=/dev/null bs=24M count=2048  skip=2048 & done
[1] 29932
[2] 29933
[3] 29934
[4] 29935
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s

Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.

Rick Wagner (Inactive) added a comment - 12/Feb/15 4:02 AM Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one. Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes): [client] $ ping 192.168.95.158 ... 64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms 64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms 64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms 64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms [client] $ iperf -c 192.168.95.158 ------------------------------------------------------------ Client connecting to 192.168.95.158, TCP port 5001 TCP window size: 92.9 KByte ( default ) ------------------------------------------------------------ [ 3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 34.6 GBytes 29.7 Gbits/sec When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough. [oss] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=32 credits=1024 [oss] $ grep 110 /proc/sys/lnet/peers 192.168.123.110@tcp 33 NA -1 32 32 32 0 -9 33556736 On the client: max_pages_per_rpc = 1024 max_rpcs_in_flight = 16 [client] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=32 credits=128 Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs. After seeing the negative peer_credits , I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of. New server settings [server] cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=64 credits=1024 Single dd run [client] $ dd if =stripedblob6-1 of=/dev/ null bs=24M count=2048 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s Four simultaneous dd tasks [client] $ for i in 0 1 2 3; do dd if =stripedblob6-$i of=/dev/ null bs=24M count=2048 skip=2048 & done [1] 29932 [2] 29933 [3] 29934 [4] 29935 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.

Liang Zhen (Inactive) added a comment - 11/Feb/15 9:50 AM

I think socklnd scheduler is transparent to upper layers, also, both upper layers (lnet_selftest and lustre) share same LND connections, so there should be no difference and if lnet_selftest can drive LNet hard enough and get good performance number, I tend to think this is not an issue in LNet/LND.
Also, peer_credits=128 is too high to me, I know people need value like this only if they are running Lustre over WAN, credits=1024 peer_credits=32 should be a good experiential value to start with.

Liang Zhen (Inactive) added a comment - 11/Feb/15 9:50 AM I think socklnd scheduler is transparent to upper layers, also, both upper layers (lnet_selftest and lustre) share same LND connections, so there should be no difference and if lnet_selftest can drive LNet hard enough and get good performance number, I tend to think this is not an issue in LNet/LND. Also, peer_credits=128 is too high to me, I know people need value like this only if they are running Lustre over WAN, credits=1024 peer_credits=32 should be a good experiential value to start with.

Andreas Dilger added a comment - 10/Feb/15 10:20 PM

The original discussion of this issue was in ~~LU-5278~~:
Gabriele wrote:

Hi Rick,
take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET.
I can suggest as "gold" rule:
peer_credits=max_rpc_inflight
credits= 4x peer_credits

remember to export this value to all the cluster.

If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors.
http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

and Rick replied:

Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

Andreas Dilger added a comment - 10/Feb/15 10:20 PM The original discussion of this issue was in LU-5278 : Gabriele wrote: Hi Rick, take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET. I can suggest as "gold" rule: peer_credits=max_rpc_inflight credits= 4x peer_credits remember to export this value to all the cluster. If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors. http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf and Rick replied: Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

Jodi Levi (Inactive) added a comment - 10/Feb/15 6:24 PM

Amir,
Would you be able to have a look at this one and comment?
Thank you!

Jodi Levi (Inactive) added a comment - 10/Feb/15 6:24 PM Amir, Would you be able to have a look at this one and comment? Thank you!

Rick Wagner (Inactive) added a comment - 10/Feb/15 8:38 AM

Andreas & Gabriele, I've moved my network performance questions to separate ticket.

Rick Wagner (Inactive) added a comment - 10/Feb/15 8:38 AM Andreas & Gabriele, I've moved my network performance questions to separate ticket.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Rick Wagner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 10/Feb/15 8:31 AM

Updated:: 24/Mar/18 2:01 PM

Resolved:: 24/Mar/18 2:01 PM