[LU-58] poor LNet performance over QLogic HCAs Created: 02/Feb/11  Updated: 21/Sep/11  Resolved: 13/Jun/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Kit Westneat (Inactive) Assignee: Lai Siyao
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File lnet_selftest.sh     File opreport-l-p-2.out     File opreport-l-p-3.out     File opreport-l-p.out     File opreport-l.out     File opreport.out    
Severity: 3
Rank (Obsolete): 10244

 Description   

We have been testing QLogic HCAs for several customers and have run into an issue at our lab where rdma_bw is able to get 2.5GB/s or so, but lnet_selftest only gets 1GB/s. Actually I have gotten as much as ~1200MB/s, which leads me to believe it's capping out at 10Gb/s.

Have you ever seen this? Is there anything we can do to debug this from a ko2iblnd point of view? We have already engaged QLogic and they can't find anything wrong.



 Comments   
Comment by Cliff White (Inactive) [ 02/Feb/11 ]

How many CPU's do the system have?

Comment by Kit Westneat (Inactive) [ 02/Feb/11 ]

2 socket, 8 cores total

model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz

Comment by Liang Zhen (Inactive) [ 02/Feb/11 ]

could you post your test script here? I would like to see detail of the test.

Comment by Kit Westneat (Inactive) [ 02/Feb/11 ]

Here is the rdma_bw test I ran:

[root@oss0 ~]# rdma_bw oss1-ib0
22891: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |
22891: Local address: LID 0x06, QPN 0x0035, PSN 0x31620c RKey 0x7fdfe00 VAddr 0x002acb8070e000
22891: Remote address: LID 0x02, QPN 0x005d, PSN 0xc4e9d0, RKey 0x4191a00 VAddr 0x002b70d50bf000

22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec
22891: Bandwidth average: 2606.78 MB/sec
22891: Service Demand peak (#19 to #999): 838 cycles/KB
22891: Service Demand Avg : 847 cycles/KB

[root@oss0 ~]# rdma_bw oss1-ib0
22891: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |
22891: Local address: LID 0x06, QPN 0x0035, PSN 0x31620c RKey 0x7fdfe00 VAddr 0x002acb8070e000
22891: Remote address: LID 0x02, QPN 0x005d, PSN 0xc4e9d0, RKey 0x4191a00 VAddr 0x002b70d50bf000

22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec
22891: Bandwidth average: 2606.78 MB/sec
22891: Service Demand peak (#19 to #999): 838 cycles/KB
22891: Service Demand Avg : 847 cycles/KB

For the LNET test, I'm using the a wrapper script to call lst, I'll attach it:

[root@oss1 ~]# lnet_selftest.sh -c "192.168.99.10[1,2]@o2ib" -s 192.168.99.103@o2ib -w
You need to manually load lnet_selftest on all nodes
modprobe lnet_selftest
LST_SESSION=8760
SESSION: read/write TIMEOUT: 300 FORCE: No
192.168.99.103@o2ib are added to session
192.168.99.10[1,2]@o2ib are added to session
Test was added successfully
batch is running now
[LNet Rates of servers]
[R] Avg: 1145 RPC/s Min: 1145 RPC/s Max: 1145 RPC/s
[W] Avg: 2294 RPC/s Min: 2294 RPC/s Max: 2294 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 0.17 MB/s Min: 0.17 MB/s Max: 0.17 MB/s
[W] Avg: 1147.23 MB/s Min: 1147.23 MB/s Max: 1147.23 MB/s
session is ended

[root@oss1 ~]# lnet_selftest.sh -c "192.168.99.101@o2ib" -s 192.168.99.103@o2ib -w
You need to manually load lnet_selftest on all nodes
modprobe lnet_selftest
LST_SESSION=8815
SESSION: read/write TIMEOUT: 300 FORCE: No
192.168.99.103@o2ib are added to session
192.168.99.101@o2ib are added to session
Test was added successfully
batch is running now
[LNet Rates of servers]
[R] Avg: 1242 RPC/s Min: 1242 RPC/s Max: 1242 RPC/s
[W] Avg: 2484 RPC/s Min: 2484 RPC/s Max: 2484 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 0.19 MB/s Min: 0.19 MB/s Max: 0.19 MB/s
[W] Avg: 1241.94 MB/s Min: 1241.94 MB/s Max: 1241.94 MB/s
session is ended

[root@oss1 ~]# lnet_selftest.sh -c "192.168.99.102@o2ib" -s 192.168.99.103@o2ib -w
You need to manually load lnet_selftest on all nodes
modprobe lnet_selftest
LST_SESSION=8837
SESSION: read/write TIMEOUT: 300 FORCE: No
192.168.99.103@o2ib are added to session
192.168.99.102@o2ib are added to session
Test was added successfully
batch is running now
[LNet Rates of servers]
[R] Avg: 1087 RPC/s Min: 1087 RPC/s Max: 1087 RPC/s
[W] Avg: 2175 RPC/s Min: 2175 RPC/s Max: 2175 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 0.17 MB/s Min: 0.17 MB/s Max: 0.17 MB/s
[W] Avg: 1087.11 MB/s Min: 1087.11 MB/s Max: 1087.11 MB/s

Comment by Kit Westneat (Inactive) [ 02/Feb/11 ]

driver script for lst

Comment by Liang Zhen (Inactive) [ 03/Feb/11 ]

Kit,

we do have SMP performance issue with lnet_selftest(we will have a patch for this in weeks), but I'm not sure whether 2 * 4 cores could hit this.
If it's possible, could you please try these to help us survey:

  • disable one socket to see whether it can help on lnet_selftest performance
  • disable two cores on each socket, and measure performance with selftest
  • run it with 2 clients and 1 server, and lst stat server to see performance

Thanks
Liang

Comment by Lai Siyao [ 08/Feb/11 ]

Peter, I will talk with Liang and work on this.

Comment by Liang Zhen (Inactive) [ 09/Feb/11 ]

Kit, another question here is about NUMA, is NUMA enabled on your system (2 nodes or 1 node)?

Thanks
Liang

Comment by Kit Westneat (Inactive) [ 09/Feb/11 ]

Hi, sorry I haven't had a lot of time to do testing recently. It looks like numa is enabled (I don't know very much about numa yet):

available: 2 nodes (0-1)
node 0 size: 12120 MB
node 0 free: 11446 MB
node 1 size: 12090 MB
node 1 free: 11668 MB
node distances:
node 0 1
0: 10 20
1: 20 10

Thanks,
Kit

Comment by Christopher Morrone [ 11/Feb/11 ]

FYI, LLNL also had trouble getting good performance out of our QLogic cards with LNet. The main trouble we found is that while they implement the verbs interface for RDMA calls, the operations are not actually RDMA.

In other words, "RDMA" operations with the qlogic cards are not zero-copy.
The qlogic cards can't write directly into the destination buffer on the node; they need to do a memory copy.

There were other tweaks we made that got performance a little higher, but ultimately the lack of true RDMA support on the card was the limiting factor.

Our IB guy is out today, and I don't remember the details of what he did to tweak the qlogic performance. I think that the in-kernel verbs interface only has a single qlogic ring buffer by default, and I believe that he increased that to 4 and we saw some benefit.

Comment by Ira Weiny (Inactive) [ 11/Feb/11 ]

Disclaimer: We are running the 7340 card so if you have another card I don't know if this will apply or not.

I have forgotten some of the details but check your driver for the following options. Here are the settings we are using.

options ib_qib krcvqs=4
options ib_qib rcvhdrcnt=32768

The krcvqs option increases the number of receive queues used by the driver. We have 12 cores/node and the card has 18 contexts. 1 of those is used for something I don't remember. The rest QLogic recommends allocating 1/core so that left us with 5 (you will have more). We played around and 4 seemed like the best performance. However, this required a patch to the module to make it actually use all 4 contexts. QLogic has the final patch and should be able to provide it.

The rcvhdrcnt increases a header descriptor count (again I would have to dig up the details about this). Regardless, this option was another patch to the driver and is now in the upstream kernel. We came across the need for this option when we got hangs from the card. QLogic fixed the hang with another patch so you might need to make sure that is available as well. Anyway during all that testing we found performance was a bit better with rcvhdrcnt set higher and so we left it.

I will check with QLogic and make sure but I don't see why you could not pull our version of the driver. I have a git tree which will build stand alone against a current RHEL5 kernel. (It may need other modifications for other kernels). Let me know if you would like that.

Hope this helps,
Ira

Comment by Liang Zhen (Inactive) [ 11/Feb/11 ]

Chris, Ira,

Thanks for your information.
We do have performance issue on NUMA system (selftest, or obdfilter-survey, and the whole lustre stack), and as you said, QLogic doesn't have true RDMA support on the card, so memory copy might make it worse especially Lustre/LNet has many threads context switch over the stack...

I've got a branch to support NUMA better, but patch for lnet_selftest is still in progress(actually I have a old patch for lnet_selftest but it can't apply to any branch now), although most works in other modules have been done, I will post it here when I finish patch of lnet_selftest.

Kit,
If you have any chance to run those tests, please also collect output of numastat before & after each test (I don't know whether there is anyway to reset counters...)

Thanks again
Liang

Comment by Kit Westneat (Inactive) [ 14/Feb/11 ]

Unfortunately I don't currently have access to the systems, so I'm unable to do more testing. Here is the modprobe.conf line I was using:
options ib_qib singleport=1 krcvqs=8 rcvhdrcnt=4096

We're using the latest engineering build of the qib driver. I had the QLogic folks on the system looking at it, but they couldn't see anything particularly wrong. The severe performance difference between rdma_bw and lst is what makes me think it's an issue at the lnet level, I've never seen such a large difference.

I let you know how the NUMA testing goes when I'm able to get back on the system.

Comment by Kit Westneat (Inactive) [ 15/Feb/11 ]

Ira, Chris,

In your tests, did lnet_selftest performance match rdma_bw performance or did you see a mismatch? Did you see Lustre performance more on the level of lnet_selftest? I just want to see if my experience matches yours.

Thanks,
Kit

Comment by Shuichi Ihara (Inactive) [ 20/Feb/11 ]

This is the various benchmarks on Infiniband to compare Mellanox and Qlogic.
We know the Lustre performance with Mellanox QDR is well and it is close to wire-speed. However, with Qlogic QDR, we can only see 2.2-2.5GB/sec with RDMA benchmark, and 1.4GB/sec on OSS and 700MB/sec on client with LNET.
MPI on Qlogic QDR performs well, but others are really not good compared with Mellanox.

Comment by Shuichi Ihara (Inactive) [ 25/Feb/11 ]

Ira,

Have you tried rdma_bw or LNET performance before?
We also working with Qlogic and tried to run LNET selftest with Qlogic HCA, but only getting around 2GB/sec per server. Please see below lnet_selftest number on the a server and 4 clients.

 
[LNet Bandwidth of s1]
[R] Avg: 1891.21  MB/s  Min: 1891.21  MB/s  Max: 1891.21  MB/s
[W] Avg: 0.29     MB/s  Min: 0.29     MB/s  Max: 0.29     MB/s
[LNet Bandwidth of c1]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 460.37   MB/s  Min: 460.37   MB/s  Max: 460.37   MB/s
[LNet Bandwidth of c2]
[R] Avg: 0.06     MB/s  Min: 0.06     MB/s  Max: 0.06     MB/s
[W] Avg: 387.04   MB/s  Min: 387.04   MB/s  Max: 387.04   MB/s
[LNet Bandwidth of c3]
[R] Avg: 0.08     MB/s  Min: 0.08     MB/s  Max: 0.08     MB/s
[W] Avg: 533.66   MB/s  Min: 533.66   MB/s  Max: 533.66   MB/s
[LNet Bandwidth of c4]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 466.42   MB/s  Min: 466.42   MB/s  Max: 466.42   MB/s
[LNet Bandwidth of s1]
[R] Avg: 2201.17  MB/s  Min: 2201.17  MB/s  Max: 2201.17  MB/s
[W] Avg: 0.34     MB/s  Min: 0.34     MB/s  Max: 0.34     MB/s
[LNet Bandwidth of c1]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 485.16   MB/s  Min: 485.16   MB/s  Max: 485.16   MB/s
[LNet Bandwidth of c2]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 471.09   MB/s  Min: 471.09   MB/s  Max: 471.09   MB/s
[LNet Bandwidth of c3]
[R] Avg: 0.10     MB/s  Min: 0.10     MB/s  Max: 0.10     MB/s
[W] Avg: 668.38   MB/s  Min: 668.38   MB/s  Max: 668.38   MB/s
[LNet Bandwidth of c4]
[R] Avg: 0.10     MB/s  Min: 0.10     MB/s  Max: 0.10     MB/s
[W] Avg: 627.77   MB/s  Min: 627.77   MB/s  Max: 627.77   MB/s

Regarding RDMA benchmark, we did some Qlogic tuning and could get 3GB/sec as a peak, but still low when the messages size is big (e.g. 512K, 1M, 2M...) compared with Mellanox HCA. So, I wonder how much RDMA and LNET performance are you getting.

 
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 2          5000           0.68               0.67   
 4          5000           1.39               1.36   
 8          5000           2.77               2.73   
 16         5000           5.61               5.44   
 32         5000           11.19              10.88  
 64         5000           22.40              21.88  
 128        5000           45.89              43.70  
 256        5000           92.18              89.20  
 512        5000           194.19             187.29 
 1024       5000           397.34             370.98 
 2048       5000           798.60             776.63 
 4096       5000           1343.20            1281.17
 8192       5000           1920.83            1865.76
 16384      5000           2588.50            2537.42
 32768      5000           3159.68            3153.84
 65536      5000           3162.81            3162.80
 131072     5000           3075.97            3056.87
 262144     5000           3011.65            2432.95
 524288     5000           2948.59            2757.53
 1048576    5000           2910.32            2754.89
 2097152    5000           2884.98            2761.72
 4194304    5000           2860.59            2769.83
 8388608    5000           2764.13            2667.08
Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ]

Hello Liang,

After some Qlogic tuning, we got 9.6GB/sec write from 4 OSSs. (2.4-.2.5GB/sec per OSS) This is not perfect, but according Qlogic RDMA benchmark (~2.7GB/sec), the number is reasonable.

But, the read is still slow on the Lustre. The problem seems kiblnd_sd_XX threads are spending many CPU resources. Please see below "top" outputs during the benchmark. I'm also getting oprofile data and will post it. Could you have a look at them, please?

top - 22:38:03 up 2 days, 35 min,  2 users,  load average: 48.77, 42.98, 31.63
Tasks: 512 total,  51 running, 461 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us, 84.7%sy,  0.0%ni,  0.0%id,  0.0%wa, 13.8%hi,  0.7%si,  0.0%st
Mem:  24545172k total, 24328468k used,   216704k free,     3188k buffers
Swap:  2096376k total,      612k used,  2095764k free, 23775256k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                         
19702 root      25   0     0    0    0 R 80.0  0.0 129:29.76 kiblnd_sd_06                                                     
19698 root      25   0     0    0    0 R 79.4  0.0 129:37.71 kiblnd_sd_02                                                     
19701 root      25   0     0    0    0 R 78.4  0.0 129:43.95 kiblnd_sd_05                                                     
19703 root      25   0     0    0    0 R 77.8  0.0 129:28.93 kiblnd_sd_07                                                     
19700 root      25   0     0    0    0 R 77.5  0.0 130:07.11 kiblnd_sd_04                                                     
19697 root      25   0     0    0    0 R 77.2  0.0 129:00.45 kiblnd_sd_01                                                     
19699 root      25   0     0    0    0 R 66.3  0.0 129:06.17 kiblnd_sd_03                                                     
19696 root      25   0     0    0    0 R 49.6  0.0 129:31.55 kiblnd_sd_00                                                     
29257 root      15   0 24052 4744 1576 S 16.0  0.0   0:47.91 oprofiled                                                        
  564 root      10  -5     0    0    0 S  5.4  0.0  91:42.85 kswapd0                                                          
  565 root      10  -5     0    0    0 S  2.9  0.0  25:56.07 kswapd1                                                          
19965 root      15   0     0    0    0 R  2.6  0.0   7:54.08 ll_ost_io_08                                                     
20022 root      15   0     0    0    0 S  2.6  0.0   7:17.86 ll_ost_io_65                                                     
20037 root      15   0     0    0    0 R  2.6  0.0   7:24.11 ll_ost_io_80                                                     
20055 root      15   0     0    0    0 S  2.6  0.0   7:12.28 ll_ost_io_98   
Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ]

opreport output, it seems ib_qib (qlogic driver) is most high.

Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ]

"opreport -l" output.

Comment by Liang Zhen (Inactive) [ 27/Feb/11 ]

Ihara, could you run opreport like this: "opreport -l -p /lib/modules/`uname -r` > output_file" so we can see symbols of Lustre modules?

Thanks
Liang

Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ]

Liang, attached is output of "opreport -l -p /lib/modules/`uname -r`".

Comment by Liang Zhen (Inactive) [ 28/Feb/11 ]

Ihara, thanks, I have a few more questions:

  • how many clients in your tests
  • as you said, read performance is "slow", do you have any numbers at hand?
  • I assume opreport output is from OSS, and it's for "read" tests, right?

I think we probably can't help too much on high CPU load if QLogic doesn't have true RDMA, also, would it possible for you to run lnet_selftest (read and write separately, 2 or more clients with one server and concurrency=8 and brw_test size=1M) to see LNet performance.
I'm digging into QLogic driver, at the same time, it would be very helpful if you can also help us to try with ko2iblnd map_on_demand=32 (sorry but this has to be set on all nodes) and to see if this can help on performance.

Thanks
Liang

Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ]

Hi Liang,

We have 40 clients. (AMD 48 cores, 128GB memory per client) Server is Intel Westmere 8 cores, 24 GB memory. Each client's I/O throughput is also slow (500MB/sec per clients), I think this is an another problem (many cores related).. However, we don't focus this problem, but just need an aggregate server throughput for write/read.

Here is IOR results with 40 clients and 4 OSSs. We could 9.6GB/sec for write, but only getting 6.7GB/sec for read.
Max Write: 9197.14 MiB/sec (9643.90 MB/sec)
Max Read: 6408.91 MiB/sec (6720.23 MB/sec)

What I sent opreport is what I got on OSS during the read IO testing, so your assuming is correct.

Here is LNET write/read testing with from 8 clients and single server. I didn't try map_on_demand=32 yet, let me try this soon and will you updates.

  • Write
    [LNet Bandwidth of s]
    [R] Avg: 3191.25 MB/s Min: 3191.25 MB/s Max: 3191.25 MB/s
    [W] Avg: 0.49 MB/s Min: 0.49 MB/s Max: 0.49 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 0.06 MB/s Min: 0.06 MB/s Max: 0.06 MB/s
    [W] Avg: 399.11 MB/s Min: 398.03 MB/s Max: 401.14 MB/s
    [LNet Bandwidth of s]
    [R] Avg: 3188.86 MB/s Min: 3188.86 MB/s Max: 3188.86 MB/s
    [W] Avg: 0.49 MB/s Min: 0.49 MB/s Max: 0.49 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 0.06 MB/s Min: 0.06 MB/s Max: 0.06 MB/s
    [W] Avg: 399.04 MB/s Min: 397.43 MB/s Max: 400.66 MB/s
    [LNet Bandwidth of s]
    [R] Avg: 3194.23 MB/s Min: 3194.23 MB/s Max: 3194.23 MB/s
    [W] Avg: 0.49 MB/s Min: 0.49 MB/s Max: 0.49 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 0.06 MB/s Min: 0.06 MB/s Max: 0.06 MB/s
    [W] Avg: 399.31 MB/s Min: 398.11 MB/s Max: 401.14 MB/s
  • Read
    [LNet Bandwidth of s]
    [R] Avg: 0.25 MB/s Min: 0.25 MB/s Max: 0.25 MB/s
    [W] Avg: 1598.90 MB/s Min: 1598.90 MB/s Max: 1598.90 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 199.59 MB/s Min: 196.99 MB/s Max: 203.74 MB/s
    [W] Avg: 0.03 MB/s Min: 0.03 MB/s Max: 0.03 MB/s
    [LNet Bandwidth of s]
    [R] Avg: 0.25 MB/s Min: 0.25 MB/s Max: 0.25 MB/s
    [W] Avg: 1600.68 MB/s Min: 1600.68 MB/s Max: 1600.68 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 200.03 MB/s Min: 198.05 MB/s Max: 204.79 MB/s
    [W] Avg: 0.03 MB/s Min: 0.03 MB/s Max: 0.03 MB/s
    [LNet Bandwidth of s]
    [R] Avg: 0.24 MB/s Min: 0.24 MB/s Max: 0.24 MB/s
    [W] Avg: 1552.72 MB/s Min: 1552.72 MB/s Max: 1552.72 MB/s
    [LNet Bandwidth of c]
    [R] Avg: 194.65 MB/s Min: 190.99 MB/s Max: 198.14 MB/s
    [W] Avg: 0.03 MB/s Min: 0.03 MB/s Max: 0.03 MB/s
Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ]

Liang,

map_on_demand=32, it shows better performance.

Max Write: 9723.80 MiB/sec (10196.15 MB/sec)
Max Read: 7797.74 MiB/sec (8176.52 MB/sec)

does it worth to try more small numbers to map_on_demand (e.g. map_on_demand=16) and we see? we want 9.4GB/sec for read/write.

btw, what does map_on_demand mean?

Thanks
Ihara

Comment by Liang Zhen (Inactive) [ 28/Feb/11 ]

Ihara, map_on_demand means that we will enable FMR, map_on_demand=32 will use FMR for any RDMA > 32 * 4K (128K). So I suspect having a smaller map_on_demand will help nothing unless IO request size < 128k.

ko2iblnd doesn't use FMR by default, it just create a global MR and premap all memory, it's quick on some HCAs especially for small IO request because we don't need to map again before RDMA, however, I made a quick look at source code of QIB and feel it's heavy to send fragments one by one in qib_post_send->qib_post_one_send, for 1M IO request, qib_post_one_send will be called for 256 times, so I think if we enable FMR and map pages to one fragment, it will reduce a lot of overhead.

could you please gather oprofiles with map_on_demand enabled? So we can try to find out whether there is more we can improve

Regards
Liang

Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ]

Liang,

thanks for your descriptions. I've just attached a output "opreport -l -p /lib/modules/`uname -r`" after set ko2iblnd map_on_demand=32. I collected this data during the read testing.

Comment by Liang Zhen (Inactive) [ 01/Mar/11 ]

I suspect ib_post_send will do memory copy for QIB, if so probably the only way is moving ib_post_send out from spinlock of o2iblnd (conn:ibc_lock), it's not very easy because credits system of o2iblnd replies on this spinlock. I need some time to think about it.

Thanks
Liang

Comment by Liang Zhen (Inactive) [ 01/Mar/11 ]

Ihara,

I guess I was wrong in previous comment, there is no reason we call ib_post_send with holding of ibc_lock, although I'm not 100% sure, but I think we are having this just because o2iblnd inherit this piece of code from another old LND(iiblnd), which doesn't allow re-entrant of the same QP, which is not the case of OFED.
I've posted a patch at here: http://review.whamcloud.com/#change,285
This patch will release ibc_lock before calling ib_post_send, which will avoid a lot of connections if there are some other threads want to post more data on the same connection.

if it's possible, could you please try this patch with and w/o map_on_demand and collect oprofiles?
it's still an experimental patch, so please don't be surprised if you got any problem with the patch...

Liang

Comment by Shuichi Ihara (Inactive) [ 01/Mar/11 ]

Liang,

Thanks. Attached is what I collected the oprofile on OSS when I ran the read benchmark without map_on_demand setting after applied your patch.

Comment by Liang Zhen (Inactive) [ 01/Mar/11 ]

Hi Ihara, could you provide performance data as well? thanks
Liang

Comment by Shuichi Ihara (Inactive) [ 01/Mar/11 ]

Liang,

Nope, it was slower than map_on_demand=32 without patch. I'm going to run benchmark again with map_on_demand=32 and patch.

Comment by Shuichi Ihara (Inactive) [ 02/Mar/11 ]

with map_on_demand=32 and patch was slower than map_on_demand=32 without patch.
Max Write: 8752.85 MiB/sec (9178.03 MB/sec)
Max Read: 6623.76 MiB/sec (6945.51 MB/sec)

Even functions() in ko2iblnd are reduced, the performance can't go up.. Instead, qib_sdma_verbs_send() goes up really high... now.

So, is fixing qib only way to improve the performance?

Comment by Liang Zhen (Inactive) [ 03/Mar/11 ]

Ihara,
while we run lnet_selftest with "read" test, O2iblnd has one more message than "write" test:

READ
(selftest req)
server: <-- PUT_REQ <-- client
server: --> PUT_NOACK --> client
(selftest bulk)
server: --> PUT_REQ --> client
server: <-- PUT_ACK <-- client
server: --> PUT_DONE --> client
(selftest reply)
server: --> PUT_REQ --> client
server: <-- PUT_NOACK <-- client

WRITE
(selftest req)
server: <-- PUT_REQ <-- client
server: --> PUT_NOACK --> client
(selftest bulk)
server: --> GET_REQ --> client
server: --> GET_DONE --> client
(selftest reply)
server: --> PUT_REQ --> client
server: <-- PUT_NOACK <-- client

so we normally see "read" performance is a little lower than "write" performance with same tuning parameters.
But this can't explain why "read" performance dropped significantly while number of clients increased:

Data from your email:
-----------------------------------------------------
Intel (server) <-> Intel (client)
(2 x Intel E5620 2.4GHz, 8 core CPU, 24GB memory)
#client write(GB/sec) read(GB/sec)
1 2.0 2.4
2 2.6 2.4
3 3.2 2.2
4 3.2 2.0
5 3.2 2.0

Intel (server) <-> AMD (client)
(2 x Intel E5620 2.4GHz, 24GB memory) - ( 4 x AMD Opteron 6174, 8 core CPU, 128GB memory)
#client write(GB/sec) read(GB/sec)
1 1.2 0.9
2 1.4 1.9
3 2.9 2.3
4 3.1 2.2
5 3.1 2.1
6 3.1 2.0
7 3.1 1.9
8 3.1 1.8
9 3.1 1.8
10 3.1 1.7

I noticed default "credits" of o2iblnd is a little low, so it might be worth to try with higher value on both client & server:
ko2iblnd map_on_demand=16 ntx=1024 credits=512 peer_credits=32
Though I'm not sure how much it can help.

As I said in my mail, I think qib_sdma_verbs_send is still suspicious because it's holding spin_lock_irqsave all the time which could have impact on performance, hope we can get some help from qlogic engineers.

Regards
Liang

Comment by Peter Jones [ 13/Jun/11 ]

Ihara

Any update on this?

Thanks

Peter

Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ]

we replaeced Qlogic HCA with Mellanox finally. so, at this morment, it's ok to be close, we can't still achieve same numbers (which I got on mellanox) on Qlogic HCA though..

Comment by Peter Jones [ 13/Jun/11 ]

ok, thanks Ihara

Generated at Sat Feb 10 01:03:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.