Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14293

Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      During performance testing of a new Lustre file system, we discovered that read/write performance aren't where we would expect. As an example, the block level read performance for the system is just over 65GB/s. In scaling tests, we can only get to around 30 GB/s for reads. Writes are slightly better, but still in the 35GB/s range. At single node scale, we seem to cap out at a few GB/s.

      After going through tunings and everything that we can find, we're slightly better, but still miles behind where performance should be. We've played with various ksocklnd parameters (nconnds, nscheds, tx/rx buffer size, etc), but really to not much change. Current tunings that may be relevant: credits 2560, peer credits 63, max_rpcs_in_flight 32.

      Network configuration on the servers is 2x 100G ethernet bonded together (active/active) using kernel bonding (not ksocklnd bonding).

      iperf between two nodes gets nearly line rate at ~98Gb/s and iperf from two nodes to a single node can push ~190Gb/s, consistent with what would be expected from the kernel bonding.

      lnet selftest shows about ~2.5GB/s (20Gb/s) rates for node to node tests. I'm not sure if this is a bug in lnet selftest or a real reflection of the performance.

      We found the following related tickets/mailing list discussions which seem to be very similar to what we're seeing, but with no resolutions:

      http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2019-August/016630.html

      https://jira.whamcloud.com/browse/LU-11415

      https://jira.whamcloud.com/browse/LU-12815 (maybe performance limiting, but I doubt it for what we're seeing)

       

      Any help or suggestions would be awesome.

      Thanks!

      • Jeff

      Attachments

        Issue Links

          Activity

            [LU-14293] Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet

            Talking to Peter Jones this is treated as a new feature so this will not landed to 2.12 LTS. We can close this ticket.

            simmonsja James A Simmons added a comment - Talking to Peter Jones this is treated as a new feature so this will not landed to 2.12 LTS. We can close this ticket.

            The patches for LU-12815 have been backported to 2.12 LTS. Will they land or should we close this ticket?

            simmonsja James A Simmons added a comment - The patches for LU-12815 have been backported to 2.12 LTS. Will they land or should we close this ticket?
            nilesj Jeff Niles added a comment -

            Sounds good. The new issue is LU-14320.

            Thanks everyone!

            nilesj Jeff Niles added a comment - Sounds good. The new issue is LU-14320 . Thanks everyone!
            pjones Peter Jones added a comment -

            Yup I agree - new ticket for the latest issues and we can leave this open until the LU-12815 patches are landed to b2_12.

            pjones Peter Jones added a comment - Yup I agree - new ticket for the latest issues and we can leave this open until the LU-12815 patches are landed to b2_12.

            It might make sense to keep this issue open to track the socklnd conns_per_peer feature for your use in 2.12.x, since LU-12815 will be closed once the patches are landed on master for 2.15 (though Peter may have other methods for tracking this). In the meantime, pending final review, testing, and landing of the LU-12815 patch series, there isn't a particular reason for you not to use the conns_per_peer patch on your system, since you are presumably not using the use_tcp_bonding feature yourself.

            adilger Andreas Dilger added a comment - It might make sense to keep this issue open to track the socklnd conns_per_peer feature for your use in 2.12.x, since LU-12815 will be closed once the patches are landed on master for 2.15 (though Peter may have other methods for tracking this). In the meantime, pending final review, testing, and landing of the LU-12815 patch series, there isn't a particular reason for you not to use the conns_per_peer patch on your system, since you are presumably not using the use_tcp_bonding feature yourself.

            Jeff, I definitely have some comments related to ZFS performance, but it should really go into a separate ticket. If I file that ticket, it will not be tracked correctly as a customer issue, so it is best if you do that.

            As for including conns_per_peer into 2.12, that is a bit tricky in the short term since that patch depends on another one that is removing the socklnd-level TCP bonding feature. While the LNet Multi-Rail provides better functionality, use_tcp_bonding may be in use at customer sites and shouldn't be removed in an LTS release without any warning. A patch will go into the next 2.12.7 LTS and 2.14.0 releases to announce that this option is deprecated, which will allow sites to become aware of this change and move over to LNet Multi-Rail. I've asked in LU-12815 for an email to be sent out to lustre-discuss and lustre-devel asking if anyone is using this feature, and maybe it can be removed from 2.12.8 if there is no feedback on its usage.

            adilger Andreas Dilger added a comment - Jeff, I definitely have some comments related to ZFS performance, but it should really go into a separate ticket. If I file that ticket, it will not be tracked correctly as a customer issue, so it is best if you do that. As for including conns_per_peer into 2.12, that is a bit tricky in the short term since that patch depends on another one that is removing the socklnd-level TCP bonding feature. While the LNet Multi-Rail provides better functionality, use_tcp_bonding may be in use at customer sites and shouldn't be removed in an LTS release without any warning. A patch will go into the next 2.12.7 LTS and 2.14.0 releases to announce that this option is deprecated, which will allow sites to become aware of this change and move over to LNet Multi-Rail. I've asked in LU-12815 for an email to be sent out to lustre-discuss and lustre-devel asking if anyone is using this feature, and maybe it can be removed from 2.12.8 if there is no feedback on its usage.
            nilesj Jeff Niles added a comment -

            As a side note, it may make sense for us to close this particular issue and open a new, tailored one for the issues we're seeing now, since the particular issue described in this ticket (slow lnet performance) has been resolved with the `conns_per_peer` patch.

            Along those lines, since that patch resolved our network performance issue and we'd like to keep running 2.12 (LTS), could we lobby to get James' backport of it included in the next 2.12 point release so that we don't have to keep carrying that patch?

            nilesj Jeff Niles added a comment - As a side note, it may make sense for us to close this particular issue and open a new, tailored one for the issues we're seeing now, since the particular issue described in this ticket (slow lnet performance) has been resolved with the `conns_per_peer` patch. Along those lines, since that patch resolved our network performance issue and we'd like to keep running 2.12 (LTS), could we lobby to get James' backport of it included in the next 2.12 point release so that we don't have to keep carrying that patch?
            nilesj Jeff Niles added a comment -

            Update on where we are:

            When we stood up the system we ran some benchmarks on the raw block storage, so we're confident that the block storage can provide ~7GB/s read per LUN, with ~65GB/s read across the 12 LUNs in aggregate. What we did not do, however, was run any benchmarks on ZFS after the zpools were created on top of the LUN. Since LNET was no longer our bottleneck, we figured it would make sense to verify the stack from the bottom up, starting with the zpools. We set the zpools to `canmount=on` and changed the mountpoints, then mounted them and ran fio on them. Performance is terrible.

            Given that we have another file system running with the exact same tunings and general layout, we also checked that file system in the same manner to much the same results. Since we have past benchmarking results from that file system, we're fairly confident that at some point in the past ZFS was functioning correctly. With that knowledge (and after looking at various zfs github issues) we decided to roll back from zfs 0.8.5 to 0.7.13 to test the performance there. It seems that 0.7.13 is also providing the same results.

            I think that there may be potential value in rolling back our kernel to match what it was when we initialized the other file system, as there might be some odd interaction occurring with the kernel version we're running, but I'm not sure.

            Here's the results of our testing on a single LUN with ZFS. Keep in mind this LUN can do ~7GB/s at the block level.

            1. files | read | write
              1 file - 396 MB/s | 4.2 GB/s
              4 files - 751 MB/s | 4.7 GB/s
              12 files - 1.6 GB/s | 4.7 GB/s

            And here's the really simple fio we're running to get these numbers:

            fio --rw=read --size 20G --bs=1M --name=something --ioengine=libaio --runtime=60s --numjobs=12
            

            We're also noticing some issues where Lustre is eating into those numbers significantly when layered on top. We're going to hold off on debugging that at all until zfs is stable though, as it may just be due to the same zfs issues.

            nilesj Jeff Niles added a comment - Update on where we are: When we stood up the system we ran some benchmarks on the raw block storage, so we're confident that the block storage can provide ~7GB/s read per LUN, with ~65GB/s read across the 12 LUNs in aggregate. What we did not do, however, was run any benchmarks on ZFS after the zpools were created on top of the LUN. Since LNET was no longer our bottleneck, we figured it would make sense to verify the stack from the bottom up, starting with the zpools. We set the zpools to `canmount=on` and changed the mountpoints, then mounted them and ran fio on them. Performance is terrible . Given that we have another file system running with the exact same tunings and general layout, we also checked that file system in the same manner to much the same results. Since we have past benchmarking results from that file system, we're fairly confident that at some point in the past ZFS was functioning correctly. With that knowledge (and after looking at various zfs github issues) we decided to roll back from zfs 0.8.5 to 0.7.13 to test the performance there. It seems that 0.7.13 is also providing the same results. I think that there may be potential value in rolling back our kernel to match what it was when we initialized the other file system, as there might be some odd interaction occurring with the kernel version we're running, but I'm not sure. Here's the results of our testing on a single LUN with ZFS. Keep in mind this LUN can do ~7GB/s at the block level. files | read | write 1 file - 396 MB/s | 4.2 GB/s 4 files - 751 MB/s | 4.7 GB/s 12 files - 1.6 GB/s | 4.7 GB/s And here's the really simple fio we're running to get these numbers: fio --rw=read --size 20G --bs=1M --name=something --ioengine=libaio --runtime=60s --numjobs=12 We're also noticing some issues where Lustre is eating into those numbers significantly when layered on top. We're going to hold off on debugging that at all until zfs is stable though, as it may just be due to the same zfs issues.
            nilesj Jeff Niles added a comment -

            Here's a table of performance values using lnet_selftest at various conns_per_peer values. I assume that this changes with CPU clock speed and other factors, but it at least shows that scaling for the patch commit message.

            conns_per_peer setting - speed
            1 - 1.7GiB/s
            2 - 3.3GiB/s
            4 - 6.4GiB/s
            8 - 11.5GiB/s
            16 - 11.5GiB/s

            We did some more troubleshooting on our end yesterday and are suspecting some serious zfs issues. Currently testing an older ZFS version, and will comment again later after some more testing.

            nilesj Jeff Niles added a comment - Here's a table of performance values using lnet_selftest at various conns_per_peer values. I assume that this changes with CPU clock speed and other factors, but it at least shows that scaling for the patch commit message. conns_per_peer setting - speed 1 - 1.7GiB/s 2 - 3.3GiB/s 4 - 6.4GiB/s 8 - 11.5GiB/s 16 - 11.5GiB/s We did some more troubleshooting on our end yesterday and are suspecting some serious zfs issues. Currently testing an older ZFS version, and will comment again later after some more testing.

            nilesj could you share the performance results for different conns_per_peer values? It would be useful to include a table with this information in the commit message for the patch.

            As for the patch Amir mentioned, that was speculation regarding high CPU usage in osc_page_gang_lookup(). I can't definitively say whether that patch will help improve performance or not. Getting the flamegraphs for this would be very useful, along with what the test workload/parameters are (I'd assume IOR, but the options used are critical).

            adilger Andreas Dilger added a comment - nilesj could you share the performance results for different conns_per_peer values? It would be useful to include a table with this information in the commit message for the patch. As for the patch Amir mentioned, that was speculation regarding high CPU usage in osc_page_gang_lookup() . I can't definitively say whether that patch will help improve performance or not. Getting the flamegraphs for this would be very useful, along with what the test workload/parameters are (I'd assume IOR, but the options used are critical).

            People

              ashehata Amir Shehata (Inactive)
              nilesj Jeff Niles
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: