Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6228

How to balance network connections across socknal_sd tasks?

Details

    • Question/Request
    • Resolution: Won't Fix
    • Major
    • None
    • None
    • Linux 3.10

    Description

      While using the ksocklnd LNET driver, I've noticed uneven load across the socknal_sd* tasks on an OSS. The number of tasks is controllable using combinations of nscheds and cpu_npartitions or cpu_pattern. I've also tried adjusting /proc/sys/lnet/portal_rotor, but this does not appear to be the right thing to try.

      On a dual socket, 6 core per processor system with

      $ cat ksocklnd.conf 
      options ksocklnd nscheds=6 peer_credits=128 credits=1024
      $ cat libcfs.conf 
      options libcfs cpu_pattern="0[0,1,2,3,4,5] 1[6,7,8,9,10,11]"
      

      there are 12 socknal_sd tasks. However, with up to 60 clients doing the same streaming IO, only 4 of the tasks will be heavily loaded (CPU time over 80%). Oddly, when running an LNET bulk_rw self test, up to 10 of the task will be loaded, and can consume 9.2 GB/s on the server's bonded 40GbE links.

      What am I missing? I thought it was the mapping of TCP connections to process, but I can't seem to track them through /proc/*/fd/ and /proc/net/tcp.

      I'm working from a recent pull of the master branch.

      Attachments

        1. lnet-bandwidth-cdev-single.sh
          1 kB
        2. lnet-results-2cli.txt
          8 kB
        3. lnet-results-alternate-NICs.txt
          3 kB
        4. lnet-results-alternate-NICs-irqmap.txt
          12 kB
        5. lnet-test-2cli.sh
          1 kB
        6. lnet-test-alt-nics.sh
          1 kB
        7. lnet-test-alt-nics-irqmap.sh
          1 kB
        8. lst-1-to-1-conc-1-to-64.txt
          17 kB

        Issue Links

          Activity

            [LU-6228] How to balance network connections across socknal_sd tasks?

            A few suggestions:

            • Please change the lst script so it'd use "check=none" instead of "check=simple".
            • Right after lst test, please do a "lctl --net tcp conn_list" on both the client and the server.
            • Please try increasing the dd bs parameter to see if it makes any difference.
            • If possible, during the lst tests, please run tcpdump to watch for TCP window sizes and MSS.
            isaac Isaac Huang (Inactive) added a comment - A few suggestions: Please change the lst script so it'd use "check=none" instead of "check=simple" . Right after lst test, please do a "lctl --net tcp conn_list" on both the client and the server. Please try increasing the dd bs parameter to see if it makes any difference. If possible, during the lst tests, please run tcpdump to watch for TCP window sizes and MSS.
            rpwagner Rick Wagner (Inactive) added a comment - - edited

            LNet self test script and results for client:server ratio of 1:1 and concurrency from 1 to 64.

            During the LNet test, writes scaled from 1.5 GB/s to 2.6 GB/s (line speed) from 1 to 8 threads and then held steady. Reads, however, would stay at 1 GB/s until 8 or 16 threads and then jump to 4.5 GB/s, and go back down to 1 GB/s at 32 or 64 threads. I tried additional dd tests with 8, 16, and 32 reading tasks, but they all hit 1 GB/s and stayed there.

            During the tests, credits on both the client and server went negative. I need to clear those and see that occurred during dd or just lst. If there's a way to do that without reloading the kernel modules, I'd love to know it.

            [server] $ cat /proc/sys/lnet/peers 
            nid                      refs state  last   max   rtr   min    tx   min queue
            0@lo                        1    NA    -1     0     0     0     0     0 0
            192.168.95.158@tcp          1    NA    -1    64    64    64    64    62 0
            192.168.123.110@tcp         1    NA    -1    64    64    64    64    -6 0
            [client] $ cat /proc/sys/lnet/peers 
            nid                      refs state  last   max   rtr   min    tx   min queue
            192.168.95.158@tcp          1    NA    -1    32    32    32    32   -33 0
            
            rpwagner Rick Wagner (Inactive) added a comment - - edited LNet self test script and results for client:server ratio of 1:1 and concurrency from 1 to 64. During the LNet test, writes scaled from 1.5 GB/s to 2.6 GB/s (line speed) from 1 to 8 threads and then held steady. Reads, however, would stay at 1 GB/s until 8 or 16 threads and then jump to 4.5 GB/s, and go back down to 1 GB/s at 32 or 64 threads. I tried additional dd tests with 8, 16, and 32 reading tasks, but they all hit 1 GB/s and stayed there. During the tests, credits on both the client and server went negative. I need to clear those and see that occurred during dd or just lst . If there's a way to do that without reloading the kernel modules, I'd love to know it. [server] $ cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 0@lo 1 NA -1 0 0 0 0 0 0 192.168.95.158@tcp 1 NA -1 64 64 64 64 62 0 192.168.123.110@tcp 1 NA -1 64 64 64 64 -6 0 [client] $ cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 192.168.95.158@tcp 1 NA -1 32 32 32 32 -33 0

            although socklnd creates three connections between any two nodes, but it only uses one as BULK_IN, one as BULK_OUT, the last one as CONTROL, which means there is only one connection (and one thread) for unidirectional dataflow, this could be the reason that no matter how many tasks/stripes you have from a single client, you always see same top performance value. However, 1.2GB/sec is kind of low even for a single connection if iperf can get 29.7Gb/sec, do you have performance number of lnet_selftest between two nodes (1:1, try concurrency from 1, 2, 4...64)?

            liang Liang Zhen (Inactive) added a comment - although socklnd creates three connections between any two nodes, but it only uses one as BULK_IN, one as BULK_OUT, the last one as CONTROL, which means there is only one connection (and one thread) for unidirectional dataflow, this could be the reason that no matter how many tasks/stripes you have from a single client, you always see same top performance value. However, 1.2GB/sec is kind of low even for a single connection if iperf can get 29.7Gb/sec, do you have performance number of lnet_selftest between two nodes (1:1, try concurrency from 1, 2, 4...64)?

            Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one.

            Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes):

            [client] $ ping 192.168.95.158 
            ...
            64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms
            64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms
            64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms
            64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms
            [client] $ iperf -c 192.168.95.158
            ------------------------------------------------------------
            Client connecting to 192.168.95.158, TCP port 5001
            TCP window size: 92.9 KByte (default)
            ------------------------------------------------------------
            [  3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001
            [ ID] Interval       Transfer     Bandwidth
            [  3]  0.0-10.0 sec  34.6 GBytes  29.7 Gbits/sec
            

            When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough.

            [oss] $ cat /etc/modprobe.d/ksocklnd.conf
            options ksocklnd peer_credits=32 credits=1024
            [oss] $ grep 110 /proc/sys/lnet/peers
            192.168.123.110@tcp        33    NA    -1    32    32    32     0    -9 33556736
            

            On the client:

            max_pages_per_rpc    = 1024
            max_rpcs_in_flight     = 16 
            [client] $ cat /etc/modprobe.d/ksocklnd.conf
            options ksocklnd peer_credits=32 credits=128
            

            Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs.

            After seeing the negative peer_credits, I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of.

            New server settings

            [server] cat /etc/modprobe.d/ksocklnd.conf
            options ksocklnd peer_credits=64 credits=1024
            

            Single dd run

            [client] $ dd if=stripedblob6-1 of=/dev/null bs=24M count=2048
            2048+0 records in
            2048+0 records out
            51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s
            

            Four simultaneous dd tasks

            [client] $ for i in 0 1 2 3; do dd if=stripedblob6-$i of=/dev/null bs=24M count=2048  skip=2048 & done
            [1] 29932
            [2] 29933
            [3] 29934
            [4] 29935
            2048+0 records in
            2048+0 records out
            51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s
            2048+0 records in
            2048+0 records out
            51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s
            2048+0 records in
            2048+0 records out
            51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s
            2048+0 records in
            2048+0 records out
            51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s
            

            Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.

            rpwagner Rick Wagner (Inactive) added a comment - Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one. Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes): [client] $ ping 192.168.95.158 ... 64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms 64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms 64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms 64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms [client] $ iperf -c 192.168.95.158 ------------------------------------------------------------ Client connecting to 192.168.95.158, TCP port 5001 TCP window size: 92.9 KByte ( default ) ------------------------------------------------------------ [ 3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 34.6 GBytes 29.7 Gbits/sec When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough. [oss] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=32 credits=1024 [oss] $ grep 110 /proc/sys/lnet/peers 192.168.123.110@tcp 33 NA -1 32 32 32 0 -9 33556736 On the client: max_pages_per_rpc = 1024 max_rpcs_in_flight = 16 [client] $ cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=32 credits=128 Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs. After seeing the negative peer_credits , I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of. New server settings [server] cat /etc/modprobe.d/ksocklnd.conf options ksocklnd peer_credits=64 credits=1024 Single dd run [client] $ dd if =stripedblob6-1 of=/dev/ null bs=24M count=2048 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s Four simultaneous dd tasks [client] $ for i in 0 1 2 3; do dd if =stripedblob6-$i of=/dev/ null bs=24M count=2048 skip=2048 & done [1] 29932 [2] 29933 [3] 29934 [4] 29935 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s 2048+0 records in 2048+0 records out 51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.

            I think socklnd scheduler is transparent to upper layers, also, both upper layers (lnet_selftest and lustre) share same LND connections, so there should be no difference and if lnet_selftest can drive LNet hard enough and get good performance number, I tend to think this is not an issue in LNet/LND.
            Also, peer_credits=128 is too high to me, I know people need value like this only if they are running Lustre over WAN, credits=1024 peer_credits=32 should be a good experiential value to start with.

            liang Liang Zhen (Inactive) added a comment - I think socklnd scheduler is transparent to upper layers, also, both upper layers (lnet_selftest and lustre) share same LND connections, so there should be no difference and if lnet_selftest can drive LNet hard enough and get good performance number, I tend to think this is not an issue in LNet/LND. Also, peer_credits=128 is too high to me, I know people need value like this only if they are running Lustre over WAN, credits=1024 peer_credits=32 should be a good experiential value to start with.

            The original discussion of this issue was in LU-5278:
            Gabriele wrote:

            Hi Rick,
            take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET.
            I can suggest as "gold" rule:
            peer_credits=max_rpc_inflight
            credits= 4x peer_credits

            remember to export this value to all the cluster.

            If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors.
            http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

            and Rick replied:

            Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

            adilger Andreas Dilger added a comment - The original discussion of this issue was in LU-5278 : Gabriele wrote: Hi Rick, take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET. I can suggest as "gold" rule: peer_credits=max_rpc_inflight credits= 4x peer_credits remember to export this value to all the cluster. If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors. http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf and Rick replied: Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

            Amir,
            Would you be able to have a look at this one and comment?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Amir, Would you be able to have a look at this one and comment? Thank you!

            Andreas & Gabriele, I've moved my network performance questions to separate ticket.

            rpwagner Rick Wagner (Inactive) added a comment - Andreas & Gabriele, I've moved my network performance questions to separate ticket.

            People

              ashehata Amir Shehata (Inactive)
              rpwagner Rick Wagner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: