Thanks for explaining that, Liang. What I'm seeing is that it takes a very large number of clients to get a good read bandwidth numbers. Our servers have 6 OSTs, and each will deliver 1.5 GB/s per OST using dd and ZFS, and 9 GB/s in aggregate. When mounting over the network, a single will top out at 1.2 GB/s from a single OSS, no matter how many tasks are running, or whether the files are striped on single or multiple OSTs. It feels like something is holding back the per-client bandwidth. It takes four clients to get 1.5 GB/s from an OST, which it should only be one.
Our servers have bonded 40GbE interfaces, and the clients use TCP via IPoIB and Mellanox gateway switches that bridge between Ethernet and InfiniBand. Here are some simple measurements to show the state of the network (I used a single stream Iperf test, because Lustre only connects over individual sockets for reads and writes):
[client] $ ping 192.168.95.158
...
64 bytes from 192.168.95.158: icmp_seq=4 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=5 ttl=62 time=0.108 ms
64 bytes from 192.168.95.158: icmp_seq=6 ttl=62 time=0.106 ms
64 bytes from 192.168.95.158: icmp_seq=7 ttl=62 time=0.103 ms
[client] $ iperf -c 192.168.95.158
------------------------------------------------------------
Client connecting to 192.168.95.158, TCP port 5001
TCP window size: 92.9 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.123.110 port 37190 connected with 192.168.95.158 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 34.6 GBytes 29.7 Gbits/sec
When trying to dd 4 files striped on all OSTs of the OSS, 32 peer_credits was not enough.
[oss] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=1024
[oss] $ grep 110 /proc/sys/lnet/peers
192.168.123.110@tcp 33 NA -1 32 32 32 0 -9 33556736
On the client:
max_pages_per_rpc = 1024
max_rpcs_in_flight = 16
[client] $ cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=32 credits=128
Observing brw_stats under /proc/fs/lustre/osd-zfs/*/brw_stats shows that I/O requests are coming in at 4M, as expected. We're running Lustre and ZFS with large block support, which is why we get good streaming performance from single OSTs.
After seeing the negative peer_credits, I increased them and reran. Here's an example, where reading in 4 files uses the same amount of bandwidth as a single file, but less than the client and server are capable of.
New server settings
[server] cat /etc/modprobe.d/ksocklnd.conf
options ksocklnd peer_credits=64 credits=1024
Single dd run
[client] $ dd if=stripedblob6-1 of=/dev/null bs=24M count=2048
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 48.6514 s, 1.1 GB/s
Four simultaneous dd tasks
[client] $ for i in 0 1 2 3; do dd if=stripedblob6-$i of=/dev/null bs=24M count=2048 skip=2048 & done
[1] 29932
[2] 29933
[3] 29934
[4] 29935
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 167.059 s, 309 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 171.848 s, 300 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 179.851 s, 287 MB/s
2048+0 records in
2048+0 records out
51539607552 bytes (52 GB) copied, 182.335 s, 283 MB/s
Running zpool iostat on the server shows similar bandwidth. Messing with the ZFS ARC doesn't change things, since I'm deliberately blowing through any caches wish large file sizes.
A few suggestions: