Liang, thanks for your suggestions, I started working through the options and came up with a solution that should for us. With what I'm about to describe, I reliably streamed files at 7.2 to 7.4 GB/s to 12 clients, with each client reading 8 files. I think there's room for improvement in the performance, and certainly in reducing the number of clients, but this was repeatable and it's a lot of progress.
First, I made a mistake about the placement of the HBAs: two of them are on CPU0 with the NICs. All of this was on the server with dual Intel E5-2650v2 processors (8 core, 2.6 GHz). In ASCII art, the PCI layout looks like this:
We have the freedom to move cards around (somewhat), but not to break the network bonding. The ZFS zpools are configured as raidz2 8+2, with one 10 drive pool spanning the 25 drive HBAs on CPU0 and CPU1.
What I found was that restricting the ksocklnd tasks to CPU0 had the biggest impact, and that it was better to let the other tasks run on both CPU0 and CPU1. Here are the configuration files from the servers:
Moving the various oss tasks to partition 0 or 1 did not help, more than likely because the topology does not match what I described originally.
The client configuration is minimal, with the only change being setting max_rpcs_in_flight to 16.
You'll note that the number of credits and RPCs in flight did not need to be very high. I attribute this with a relatively low bandwidth-delay product (10 GB/s x 0.1 ms = 1 MB). I tested a larger number of maximum pages, but it drove down performance. I need to revisit that, since it could related to the BDP, the ZFS record size (also 1 MB), or it could be improved with the ZFS tuning I did.
One thing that surprised me was that setting the IRQ affinity for the Mellanox NICs reduced performance. However, it was still better to restrict the CPU partion on NUMA node 0 to cores [2-7].
The last thing that help get the performance up was to improve chances for ZFS to prefetch data. While testing, I did an experiment to differentiate between the impact of the networking and ZFS, and had several (~10) clients read the same 64 GiB file from an OST. This was chosen to match the maximum of the ZFS ARC, plus whatever caches Lustre had. When doing this, the server bandwidth was saturated at 10 GB/s, and showed that getting data from the drives to memory was critical, even if the data was across the QPI link.
The branch of ZFS I'm using sets most of the tuning parameters to 0, and the important one was zfs_vdev_cache_size. My reading of random blog posts indicates that this impacts prefetch from the DMU.
Regardless, this immediately improved the rate at which the zpools could deliver data.
This is a bit of a long comment because I wanted to capture a lot of the details. If you see anything worth examining given my corrected information, please let me know. Our next step from here is to try incorporating the patches we're using into a stable release, and retesting with the Linux 2.6 kernel, or with the EPEL 3.10 kernel-lt package.
I don't think that any further work is needed here