[LU-58] poor LNet performance over QLogic HCAs Created: 02/Feb/11 Updated: 21/Sep/11 Resolved: 13/Jun/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10244 |
| Description |
|
We have been testing QLogic HCAs for several customers and have run into an issue at our lab where rdma_bw is able to get 2.5GB/s or so, but lnet_selftest only gets 1GB/s. Actually I have gotten as much as ~1200MB/s, which leads me to believe it's capping out at 10Gb/s. Have you ever seen this? Is there anything we can do to debug this from a ko2iblnd point of view? We have already engaged QLogic and they can't find anything wrong. |
| Comments |
| Comment by Cliff White (Inactive) [ 02/Feb/11 ] |
|
How many CPU's do the system have? |
| Comment by Kit Westneat (Inactive) [ 02/Feb/11 ] |
|
2 socket, 8 cores total model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz |
| Comment by Liang Zhen (Inactive) [ 02/Feb/11 ] |
|
could you post your test script here? I would like to see detail of the test. |
| Comment by Kit Westneat (Inactive) [ 02/Feb/11 ] |
|
Here is the rdma_bw test I ran: [root@oss0 ~]# rdma_bw oss1-ib0 22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec [root@oss0 ~]# rdma_bw oss1-ib0 22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec For the LNET test, I'm using the a wrapper script to call lst, I'll attach it: [root@oss1 ~]# lnet_selftest.sh -c "192.168.99.10[1,2]@o2ib" -s 192.168.99.103@o2ib -w [root@oss1 ~]# lnet_selftest.sh -c "192.168.99.101@o2ib" -s 192.168.99.103@o2ib -w [root@oss1 ~]# lnet_selftest.sh -c "192.168.99.102@o2ib" -s 192.168.99.103@o2ib -w |
| Comment by Kit Westneat (Inactive) [ 02/Feb/11 ] |
|
driver script for lst |
| Comment by Liang Zhen (Inactive) [ 03/Feb/11 ] |
|
Kit, we do have SMP performance issue with lnet_selftest(we will have a patch for this in weeks), but I'm not sure whether 2 * 4 cores could hit this.
Thanks |
| Comment by Lai Siyao [ 08/Feb/11 ] |
|
Peter, I will talk with Liang and work on this. |
| Comment by Liang Zhen (Inactive) [ 09/Feb/11 ] |
|
Kit, another question here is about NUMA, is NUMA enabled on your system (2 nodes or 1 node)? Thanks |
| Comment by Kit Westneat (Inactive) [ 09/Feb/11 ] |
|
Hi, sorry I haven't had a lot of time to do testing recently. It looks like numa is enabled (I don't know very much about numa yet): available: 2 nodes (0-1) Thanks, |
| Comment by Christopher Morrone [ 11/Feb/11 ] |
|
FYI, LLNL also had trouble getting good performance out of our QLogic cards with LNet. The main trouble we found is that while they implement the verbs interface for RDMA calls, the operations are not actually RDMA. In other words, "RDMA" operations with the qlogic cards are not zero-copy. There were other tweaks we made that got performance a little higher, but ultimately the lack of true RDMA support on the card was the limiting factor. Our IB guy is out today, and I don't remember the details of what he did to tweak the qlogic performance. I think that the in-kernel verbs interface only has a single qlogic ring buffer by default, and I believe that he increased that to 4 and we saw some benefit. |
| Comment by Ira Weiny (Inactive) [ 11/Feb/11 ] |
|
Disclaimer: We are running the 7340 card so if you have another card I don't know if this will apply or not. I have forgotten some of the details but check your driver for the following options. Here are the settings we are using. options ib_qib krcvqs=4 The krcvqs option increases the number of receive queues used by the driver. We have 12 cores/node and the card has 18 contexts. 1 of those is used for something I don't remember. The rest QLogic recommends allocating 1/core so that left us with 5 (you will have more). We played around and 4 seemed like the best performance. However, this required a patch to the module to make it actually use all 4 contexts. QLogic has the final patch and should be able to provide it. The rcvhdrcnt increases a header descriptor count (again I would have to dig up the details about this). Regardless, this option was another patch to the driver and is now in the upstream kernel. We came across the need for this option when we got hangs from the card. QLogic fixed the hang with another patch so you might need to make sure that is available as well. Anyway during all that testing we found performance was a bit better with rcvhdrcnt set higher and so we left it. I will check with QLogic and make sure but I don't see why you could not pull our version of the driver. I have a git tree which will build stand alone against a current RHEL5 kernel. (It may need other modifications for other kernels). Let me know if you would like that. Hope this helps, |
| Comment by Liang Zhen (Inactive) [ 11/Feb/11 ] |
|
Chris, Ira, Thanks for your information. I've got a branch to support NUMA better, but patch for lnet_selftest is still in progress(actually I have a old patch for lnet_selftest but it can't apply to any branch now), although most works in other modules have been done, I will post it here when I finish patch of lnet_selftest. Kit, Thanks again |
| Comment by Kit Westneat (Inactive) [ 14/Feb/11 ] |
|
Unfortunately I don't currently have access to the systems, so I'm unable to do more testing. Here is the modprobe.conf line I was using: We're using the latest engineering build of the qib driver. I had the QLogic folks on the system looking at it, but they couldn't see anything particularly wrong. The severe performance difference between rdma_bw and lst is what makes me think it's an issue at the lnet level, I've never seen such a large difference. I let you know how the NUMA testing goes when I'm able to get back on the system. |
| Comment by Kit Westneat (Inactive) [ 15/Feb/11 ] |
|
Ira, Chris, In your tests, did lnet_selftest performance match rdma_bw performance or did you see a mismatch? Did you see Lustre performance more on the level of lnet_selftest? I just want to see if my experience matches yours. Thanks, |
| Comment by Shuichi Ihara (Inactive) [ 20/Feb/11 ] |
|
This is the various benchmarks on Infiniband to compare Mellanox and Qlogic. |
| Comment by Shuichi Ihara (Inactive) [ 25/Feb/11 ] |
|
Ira, Have you tried rdma_bw or LNET performance before? [LNet Bandwidth of s1] [R] Avg: 1891.21 MB/s Min: 1891.21 MB/s Max: 1891.21 MB/s [W] Avg: 0.29 MB/s Min: 0.29 MB/s Max: 0.29 MB/s [LNet Bandwidth of c1] [R] Avg: 0.07 MB/s Min: 0.07 MB/s Max: 0.07 MB/s [W] Avg: 460.37 MB/s Min: 460.37 MB/s Max: 460.37 MB/s [LNet Bandwidth of c2] [R] Avg: 0.06 MB/s Min: 0.06 MB/s Max: 0.06 MB/s [W] Avg: 387.04 MB/s Min: 387.04 MB/s Max: 387.04 MB/s [LNet Bandwidth of c3] [R] Avg: 0.08 MB/s Min: 0.08 MB/s Max: 0.08 MB/s [W] Avg: 533.66 MB/s Min: 533.66 MB/s Max: 533.66 MB/s [LNet Bandwidth of c4] [R] Avg: 0.07 MB/s Min: 0.07 MB/s Max: 0.07 MB/s [W] Avg: 466.42 MB/s Min: 466.42 MB/s Max: 466.42 MB/s [LNet Bandwidth of s1] [R] Avg: 2201.17 MB/s Min: 2201.17 MB/s Max: 2201.17 MB/s [W] Avg: 0.34 MB/s Min: 0.34 MB/s Max: 0.34 MB/s [LNet Bandwidth of c1] [R] Avg: 0.07 MB/s Min: 0.07 MB/s Max: 0.07 MB/s [W] Avg: 485.16 MB/s Min: 485.16 MB/s Max: 485.16 MB/s [LNet Bandwidth of c2] [R] Avg: 0.07 MB/s Min: 0.07 MB/s Max: 0.07 MB/s [W] Avg: 471.09 MB/s Min: 471.09 MB/s Max: 471.09 MB/s [LNet Bandwidth of c3] [R] Avg: 0.10 MB/s Min: 0.10 MB/s Max: 0.10 MB/s [W] Avg: 668.38 MB/s Min: 668.38 MB/s Max: 668.38 MB/s [LNet Bandwidth of c4] [R] Avg: 0.10 MB/s Min: 0.10 MB/s Max: 0.10 MB/s [W] Avg: 627.77 MB/s Min: 627.77 MB/s Max: 627.77 MB/s Regarding RDMA benchmark, we did some Qlogic tuning and could get 3GB/sec as a peak, but still low when the messages size is big (e.g. 512K, 1M, 2M...) compared with Mellanox HCA. So, I wonder how much RDMA and LNET performance are you getting. #bytes #iterations BW peak[MB/sec] BW average[MB/sec] 2 5000 0.68 0.67 4 5000 1.39 1.36 8 5000 2.77 2.73 16 5000 5.61 5.44 32 5000 11.19 10.88 64 5000 22.40 21.88 128 5000 45.89 43.70 256 5000 92.18 89.20 512 5000 194.19 187.29 1024 5000 397.34 370.98 2048 5000 798.60 776.63 4096 5000 1343.20 1281.17 8192 5000 1920.83 1865.76 16384 5000 2588.50 2537.42 32768 5000 3159.68 3153.84 65536 5000 3162.81 3162.80 131072 5000 3075.97 3056.87 262144 5000 3011.65 2432.95 524288 5000 2948.59 2757.53 1048576 5000 2910.32 2754.89 2097152 5000 2884.98 2761.72 4194304 5000 2860.59 2769.83 8388608 5000 2764.13 2667.08 |
| Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ] |
|
Hello Liang, After some Qlogic tuning, we got 9.6GB/sec write from 4 OSSs. (2.4-.2.5GB/sec per OSS) This is not perfect, but according Qlogic RDMA benchmark (~2.7GB/sec), the number is reasonable. But, the read is still slow on the Lustre. The problem seems kiblnd_sd_XX threads are spending many CPU resources. Please see below "top" outputs during the benchmark. I'm also getting oprofile data and will post it. Could you have a look at them, please? top - 22:38:03 up 2 days, 35 min, 2 users, load average: 48.77, 42.98, 31.63 Tasks: 512 total, 51 running, 461 sleeping, 0 stopped, 0 zombie Cpu(s): 0.8%us, 84.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 13.8%hi, 0.7%si, 0.0%st Mem: 24545172k total, 24328468k used, 216704k free, 3188k buffers Swap: 2096376k total, 612k used, 2095764k free, 23775256k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19702 root 25 0 0 0 0 R 80.0 0.0 129:29.76 kiblnd_sd_06 19698 root 25 0 0 0 0 R 79.4 0.0 129:37.71 kiblnd_sd_02 19701 root 25 0 0 0 0 R 78.4 0.0 129:43.95 kiblnd_sd_05 19703 root 25 0 0 0 0 R 77.8 0.0 129:28.93 kiblnd_sd_07 19700 root 25 0 0 0 0 R 77.5 0.0 130:07.11 kiblnd_sd_04 19697 root 25 0 0 0 0 R 77.2 0.0 129:00.45 kiblnd_sd_01 19699 root 25 0 0 0 0 R 66.3 0.0 129:06.17 kiblnd_sd_03 19696 root 25 0 0 0 0 R 49.6 0.0 129:31.55 kiblnd_sd_00 29257 root 15 0 24052 4744 1576 S 16.0 0.0 0:47.91 oprofiled 564 root 10 -5 0 0 0 S 5.4 0.0 91:42.85 kswapd0 565 root 10 -5 0 0 0 S 2.9 0.0 25:56.07 kswapd1 19965 root 15 0 0 0 0 R 2.6 0.0 7:54.08 ll_ost_io_08 20022 root 15 0 0 0 0 S 2.6 0.0 7:17.86 ll_ost_io_65 20037 root 15 0 0 0 0 R 2.6 0.0 7:24.11 ll_ost_io_80 20055 root 15 0 0 0 0 S 2.6 0.0 7:12.28 ll_ost_io_98 |
| Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ] |
|
opreport output, it seems ib_qib (qlogic driver) is most high. |
| Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ] |
|
"opreport -l" output. |
| Comment by Liang Zhen (Inactive) [ 27/Feb/11 ] |
|
Ihara, could you run opreport like this: "opreport -l -p /lib/modules/`uname -r` > output_file" so we can see symbols of Lustre modules? Thanks |
| Comment by Shuichi Ihara (Inactive) [ 27/Feb/11 ] |
|
Liang, attached is output of "opreport -l -p /lib/modules/`uname -r`". |
| Comment by Liang Zhen (Inactive) [ 28/Feb/11 ] |
|
Ihara, thanks, I have a few more questions:
I think we probably can't help too much on high CPU load if QLogic doesn't have true RDMA, also, would it possible for you to run lnet_selftest (read and write separately, 2 or more clients with one server and concurrency=8 and brw_test size=1M) to see LNet performance. Thanks |
| Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ] |
|
Hi Liang, We have 40 clients. (AMD 48 cores, 128GB memory per client) Server is Intel Westmere 8 cores, 24 GB memory. Each client's I/O throughput is also slow (500MB/sec per clients), I think this is an another problem (many cores related).. However, we don't focus this problem, but just need an aggregate server throughput for write/read. Here is IOR results with 40 clients and 4 OSSs. We could 9.6GB/sec for write, but only getting 6.7GB/sec for read. What I sent opreport is what I got on OSS during the read IO testing, so your assuming is correct. Here is LNET write/read testing with from 8 clients and single server. I didn't try map_on_demand=32 yet, let me try this soon and will you updates.
|
| Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ] |
|
Liang, map_on_demand=32, it shows better performance. Max Write: 9723.80 MiB/sec (10196.15 MB/sec) does it worth to try more small numbers to map_on_demand (e.g. map_on_demand=16) and we see? we want 9.4GB/sec for read/write. btw, what does map_on_demand mean? Thanks |
| Comment by Liang Zhen (Inactive) [ 28/Feb/11 ] |
|
Ihara, map_on_demand means that we will enable FMR, map_on_demand=32 will use FMR for any RDMA > 32 * 4K (128K). So I suspect having a smaller map_on_demand will help nothing unless IO request size < 128k. ko2iblnd doesn't use FMR by default, it just create a global MR and premap all memory, it's quick on some HCAs especially for small IO request because we don't need to map again before RDMA, however, I made a quick look at source code of QIB and feel it's heavy to send fragments one by one in qib_post_send->qib_post_one_send, for 1M IO request, qib_post_one_send will be called for 256 times, so I think if we enable FMR and map pages to one fragment, it will reduce a lot of overhead. could you please gather oprofiles with map_on_demand enabled? So we can try to find out whether there is more we can improve Regards |
| Comment by Shuichi Ihara (Inactive) [ 28/Feb/11 ] |
|
Liang, thanks for your descriptions. I've just attached a output "opreport -l -p /lib/modules/`uname -r`" after set ko2iblnd map_on_demand=32. I collected this data during the read testing. |
| Comment by Liang Zhen (Inactive) [ 01/Mar/11 ] |
|
I suspect ib_post_send will do memory copy for QIB, if so probably the only way is moving ib_post_send out from spinlock of o2iblnd (conn:ibc_lock), it's not very easy because credits system of o2iblnd replies on this spinlock. I need some time to think about it. Thanks |
| Comment by Liang Zhen (Inactive) [ 01/Mar/11 ] |
|
Ihara, I guess I was wrong in previous comment, there is no reason we call ib_post_send with holding of ibc_lock, although I'm not 100% sure, but I think we are having this just because o2iblnd inherit this piece of code from another old LND(iiblnd), which doesn't allow re-entrant of the same QP, which is not the case of OFED. if it's possible, could you please try this patch with and w/o map_on_demand and collect oprofiles? Liang |
| Comment by Shuichi Ihara (Inactive) [ 01/Mar/11 ] |
|
Liang, Thanks. Attached is what I collected the oprofile on OSS when I ran the read benchmark without map_on_demand setting after applied your patch. |
| Comment by Liang Zhen (Inactive) [ 01/Mar/11 ] |
|
Hi Ihara, could you provide performance data as well? thanks |
| Comment by Shuichi Ihara (Inactive) [ 01/Mar/11 ] |
|
Liang, Nope, it was slower than map_on_demand=32 without patch. I'm going to run benchmark again with map_on_demand=32 and patch. |
| Comment by Shuichi Ihara (Inactive) [ 02/Mar/11 ] |
|
with map_on_demand=32 and patch was slower than map_on_demand=32 without patch. Even functions() in ko2iblnd are reduced, the performance can't go up.. Instead, qib_sdma_verbs_send() goes up really high... now. So, is fixing qib only way to improve the performance? |
| Comment by Liang Zhen (Inactive) [ 03/Mar/11 ] |
|
Ihara, READ WRITE so we normally see "read" performance is a little lower than "write" performance with same tuning parameters. Data from your email: Intel (server) <-> AMD (client) I noticed default "credits" of o2iblnd is a little low, so it might be worth to try with higher value on both client & server: As I said in my mail, I think qib_sdma_verbs_send is still suspicious because it's holding spin_lock_irqsave all the time which could have impact on performance, hope we can get some help from qlogic engineers. Regards |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Ihara Any update on this? Thanks Peter |
| Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ] |
|
we replaeced Qlogic HCA with Mellanox finally. so, at this morment, it's ok to be close, we can't still achieve same numbers (which I got on mellanox) on Qlogic HCA though.. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
ok, thanks Ihara |