[LU-3570] Implement CRC32C with PCLMULQDQ instructions Created: 09/Jul/13 Updated: 04/Nov/13 Resolved: 04/Nov/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Any processor with the pclmulqdq. |
||
| Rank (Obsolete): | 8998 |
| Description |
|
Todays modern processors contain instruction sets that can be used to accelerate CRC32C. SSE4.2 introduced a new crc32c instruction but not all processors contain that instruction but many do support pclmulqdq. This work |
| Comments |
| Comment by James A Simmons [ 09/Jul/13 ] | |||||||||||||||
| Comment by Andreas Dilger [ 09/Jul/13 ] | |||||||||||||||
|
Color me confused, but this should already be in the Lustre code. The Lustre implementation predates that in the upstream kernel. | |||||||||||||||
| Comment by James A Simmons [ 09/Jul/13 ] | |||||||||||||||
|
That is the crc32 implementation, not the crc32c. Each uses a different polynomial. | |||||||||||||||
| Comment by Andreas Dilger [ 09/Jul/13 ] | |||||||||||||||
|
How does this PCLMULQDQ crc32c implementation performance compare to using the crc32c instruction added in Nehalem CPUs? That is what we used originally in 2.2 (http://review.whamcloud.com/1009), but it was replaced by the CryptoAPI code in 2.3 (http://review.whamcloud.com/2586). The performance information is printed to the Lustre debug log with "D_INFO" priority at startup time (though I wouldn't object to it being moved to "D_CONFIG" since it is only printed at startup). | |||||||||||||||
| Comment by Alexander Boyko [ 10/Jul/13 ] | |||||||||||||||
|
>SSE4.2 introduced a new crc32c instruction but not all processors contain that instruction but many do support pclmulqdq. This is wrong for presented patch, because it contain crc32 instruction and pclmulqdq. So processors which does not support crc32 can not use this patch. And I dont see cpu which support pclmulqdq and do not support crc32, could you point it to me? The main goal of this code is parallel processing of 3 data blocks by crc32q instruction and then combine them by pclmulqdq (there was different implementation of combining results at the intel docs). Year ago, I played with this code, and did not see 2.4 times faster result as I expected for user mode. For kernel it was slower than generic crc32c. Andreas is right, could you compare implementation performance the newest crc32c and the previous used by Lustre (modprobe crc32c)? | |||||||||||||||
| Comment by James A Simmons [ 10/Jul/13 ] | |||||||||||||||
|
Alex it is the AMD Interlogos (Bulldozers) that lack the crc32c instruction but do have pclmulqdp. Before my patch I get AMD Opteron(TM) Processor 6274 @ 2.2GHz stepping 2 Lustre: Crypto hash algorithm null speed = -1 MB/s After my patch I get Lustre: Crypto hash algorithm null speed = -1 MB/s Yes next patch will test if crc32c is present and use that instead. Also you are thinking why not use crc32 instead of crc32c. We have 19K of clients but only a 96 OSS. On the OSS hardware crc32c is much faster. | |||||||||||||||
| Comment by Keith Mannthey (Inactive) [ 10/Jul/13 ] | |||||||||||||||
Lustre: Crypto hash algorithm sha384 speed = -1 MB/s ? -1 MB/s is quite a result. Is this just not tested? | |||||||||||||||
| Comment by Alexander Boyko [ 11/Jul/13 ] | |||||||||||||||
|
James, this is
table implementation, you need to load crc32c-intel module to use hw crc32c, before modprobe libcfs. Crypto hash algorithm adler32 speed = 2124 MB/s Crypto hash algorithm crc32 speed = 2057 MB/s ..... Crypto hash algorithm crc32c speed = 2203 MB/s If you can start your patch with crc32 instruction, you can load crc32c-intel, because they based on the same CPU intstruction - crc32(for crc32c polynomial). | |||||||||||||||
| Comment by James A Simmons [ 11/Jul/13 ] | |||||||||||||||
|
There is no SSE4.2 on the Bulldozer processor. I have loaded crc32c-intel and it does not work. The crc32c instruction does not exist on the Bulldozer. For static int __init crc32c_intel_mod_init(void) -ENODEV is returned. | |||||||||||||||
| Comment by Alexander Boyko [ 11/Jul/13 ] | |||||||||||||||
|
From your patch .rept 128-1 196 .altmacro 197 LABEL crc_ %i 198 .noaltmacro 199 » crc32q -i*8(block_0), crc_init 200 » crc32q -i*8(block_1), crc1 201 » crc32q -i*8(block_2), crc2 202 » i=(i-1) 203 .endr
I think you mean crc32 instruction. So I don`t understand how this patch works on your CPU, probably the next check is invalid if (cpu_has_xmm4_2) and Bulldozer does not support xmm4_2 instr. set, and have support crc32 instruction. | |||||||||||||||
| Comment by Alexander Boyko [ 11/Jul/13 ] | |||||||||||||||
|
There is no crc32c instruction, only crc32 for crc32c polynomial. | |||||||||||||||
| Comment by Alexander Boyko [ 11/Jul/13 ] | |||||||||||||||
|
From wiki | |||||||||||||||
| Comment by James A Simmons [ 11/Jul/13 ] | |||||||||||||||
|
Interesting. You are right, I over looked the crc32q instruction in the assembly code. I wonder why the crc32c-intel module fails to load then? I will try commenting out the cpu_has_xmm4_2 check in crc32c-intel.c and see if it runs. | |||||||||||||||
| Comment by James A Simmons [ 11/Jul/13 ] | |||||||||||||||
|
Okay I found a sandy bridge system that also has this problem. I put a debug in crc32c-intel. As you can see it does detect and load the module. For some reason the software implementation is used over the hardware one. [ 2591.132551] Loaded crc32c-intel module | |||||||||||||||
| Comment by James A Simmons [ 12/Jul/13 ] | |||||||||||||||
|
Alex you were right. The AMD Bulldozer does have the crc32 instruction. I got it to work the same why I managed to get it work on the sandy bridge machine. I found from debugging that both the crc32c and crc32c-intel modules were being loaded but their is a bug in the way the rhel6 kernel handles which module to use. In the case of my sandy bridge machine the software crc32c was being picked over the intel one. By default RHEL6 builds the crc32c code into the kernel and the crc32c-intel as a module. | |||||||||||||||
| Comment by Alexander Boyko [ 15/Jul/13 ] | |||||||||||||||
|
Could you show performance results crc32c-intel vs | |||||||||||||||
| Comment by James A Simmons [ 16/Jul/13 ] | |||||||||||||||
|
Here are the results I got. AMD bulldozer with patch Lustre: Crypto hash algorithm crc32c speed = 1937 MB/s with crc32c-intel Lustre: Crypto hash algorithm crc32c speed = 1458 MB/s Intel Xeon E5-2603 with patch with crc32c-intel: Give it a try. To make it work you need to remove the crc32c-intel | |||||||||||||||
| Comment by Alexander Boyko [ 17/Jul/13 ] | |||||||||||||||
|
James, I think that loading problem was caused by the same .cra_priority for intel module and your patch, the value is 200. I have combined results of checksum algo.
| |||||||||||||||
| Comment by Alexander Boyko [ 17/Jul/13 ] | |||||||||||||||
|
This is very strange, we got that pclmul of crc32 polynomial is faster than the crc32c(h/w crc32c + pclmul) for AMD CPU. But for Intel CPU it is twice slower. In theory, we can get the same result for crc32c as crc32-pclmul(for AMD) if it will be implemented using only pclmul instruction like crc32-pclmul. | |||||||||||||||
| Comment by Alexander Boyko [ 17/Jul/13 ] | |||||||||||||||
|
Should we add this patch to Lustre if it gives 30% faster crc32c only for AMD? | |||||||||||||||
| Comment by James A Simmons [ 18/Jul/13 ] | |||||||||||||||
|
I think it could be worth while to add this patch. We could have it a module option to use this. By default it is turned off. Alex, Andreas what do you think? | |||||||||||||||
| Comment by Andreas Dilger [ 19/Jul/13 ] | |||||||||||||||
|
Are the crc32c-intel and crc32c-pclmul implementations independent, or is it one or the other? The relative performance difference shouldn't matter, since the OST and client will pick the fastest checksum that is available. | |||||||||||||||
| Comment by James A Simmons [ 22/Jul/13 ] | |||||||||||||||
|
The linux crypto layer picks either crc32c-intel or crc32c-pclmul but not both at the same time. This picking is not done with performance measuring but it is done with the cra_priority flag. That is why I suggested a module flag to turn crc32c pclmul handling on. Now in the 3.10+ kernels a hybrid approach is done that uses crc32 intel instruction for smaller data sets and pclmul for larger data sets. I still like the module flag idea better. It can be up to the deployment team to determine which is better on the hardware they are running. | |||||||||||||||
| Comment by Alexander Boyko [ 22/Jul/13 ] | |||||||||||||||
|
From my point of view, it is better to have separate module and not to add this patch to Lustre sources, -workaround for old kernels. | |||||||||||||||
| Comment by James A Simmons [ 25/Sep/13 ] | |||||||||||||||
|
Just updated the patch again to address several concerns. Fist we need this patch in the Lustre source base since most people will be running patch less clients which will lack this performance enhancement. The patch now test to see if the kernel is version 3.10 or above and will not build any of the pclmulqdq code Lustre has since it is already included upstream. Send I changed the value of the cra_priority flags to 150 which is higher than the software crc32c of 100 but lower than the crc32c-intel value of 200. This way crc32c-intel will always be preferred. |