[LU-3570] Implement CRC32C with PCLMULQDQ instructions Created: 09/Jul/13  Updated: 04/Nov/13  Resolved: 04/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0

Type: New Feature Priority: Major
Reporter: James A Simmons Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Any processor with the pclmulqdq.


Rank (Obsolete): 8998

 Description   

Todays modern processors contain instruction sets that can be used to accelerate CRC32C. SSE4.2 introduced a new crc32c instruction but not all processors contain that instruction but many do support pclmulqdq. This work
back ports from the linux 3.10 kernels support of crc32c accelerated by pclmulqdq.



 Comments   
Comment by James A Simmons [ 09/Jul/13 ]

Patch at http://review.whamcloud.com/#/c/6927/

Comment by Andreas Dilger [ 09/Jul/13 ]

Color me confused, but this should already be in the Lustre code. The Lustre implementation predates that in the upstream kernel.

Comment by James A Simmons [ 09/Jul/13 ]

That is the crc32 implementation, not the crc32c. Each uses a different polynomial.

Comment by Andreas Dilger [ 09/Jul/13 ]

How does this PCLMULQDQ crc32c implementation performance compare to using the crc32c instruction added in Nehalem CPUs? That is what we used originally in 2.2 (http://review.whamcloud.com/1009), but it was replaced by the CryptoAPI code in 2.3 (http://review.whamcloud.com/2586).

The performance information is printed to the Lustre debug log with "D_INFO" priority at startup time (though I wouldn't object to it being moved to "D_CONFIG" since it is only printed at startup).

Comment by Alexander Boyko [ 10/Jul/13 ]

>SSE4.2 introduced a new crc32c instruction but not all processors contain that instruction but many do support pclmulqdq.

This is wrong for presented patch, because it contain crc32 instruction and pclmulqdq. So processors which does not support crc32 can not use this patch. And I dont see cpu which support pclmulqdq and do not support crc32, could you point it to me?

The main goal of this code is parallel processing of 3 data blocks by crc32q instruction and then combine them by pclmulqdq (there was different implementation of combining results at the intel docs). Year ago, I played with this code, and did not see 2.4 times faster result as I expected for user mode. For kernel it was slower than generic crc32c. Andreas is right, could you compare implementation performance the newest crc32c and the previous used by Lustre (modprobe crc32c)?

Comment by James A Simmons [ 10/Jul/13 ]

Alex it is the AMD Interlogos (Bulldozers) that lack the crc32c instruction but do have pclmulqdp. Before my patch I get

AMD Opteron(TM) Processor 6274 @ 2.2GHz stepping 2

Lustre: Crypto hash algorithm null speed = -1 MB/s
Lustre: Crypto hash algorithm adler32 speed = 670 MB/s
Lustre: Crypto hash algorithm crc32 speed = 2532 MB/s (crc32-pclmul)
Lustre: Crypto hash algorithm md5 speed = 309 MB/s
Lustre: Crypto hash algorithm sha1 speed = 118 MB/s
Lustre: Crypto hash algorithm sha256 speed = -1 MB/s
Lustre: Crypto hash algorithm sha384 speed = -1 MB/s
Lustre: Crypto hash algorithm sha512 speed = -1 MB/s
Lustre: Crypto hash algorithm crc32c speed = 292 MB/s

After my patch I get

Lustre: Crypto hash algorithm null speed = -1 MB/s
Lustre: Crypto hash algorithm adler32 speed = 671 MB/s
Lustre: Crypto hash algorithm crc32 speed = 2530 MB/s (crc32-pclmul)
Lustre: Crypto hash algorithm md5 speed = 311 MB/s
Lustre: Crypto hash algorithm sha1 speed = 118 MB/s
Lustre: Crypto hash algorithm sha256 speed = -1 MB/s
Lustre: Crypto hash algorithm sha384 speed = -1 MB/s
Lustre: Crypto hash algorithm sha512 speed = -1 MB/s
Lustre: Crypto hash algorithm crc32c speed = 1937 MB/s

Yes next patch will test if crc32c is present and use that instead. Also you are thinking why not use crc32 instead of crc32c. We have 19K of clients but only a 96 OSS. On the OSS hardware crc32c is much faster.

Comment by Keith Mannthey (Inactive) [ 10/Jul/13 ]
Lustre: Crypto hash algorithm sha384 speed = -1 MB/s  ?

-1 MB/s is quite a result. Is this just not tested?

Comment by Alexander Boyko [ 11/Jul/13 ]

James, this is

Lustre: Crypto hash algorithm crc32c speed = 292 MB/s

table implementation, you need to load crc32c-intel module to use hw crc32c, before modprobe libcfs.
The next results from LU-1339, crc32c use crc32c_intel module (crypto api).

Crypto hash algorithm adler32 speed = 2124 MB/s
Crypto hash algorithm crc32 speed = 2057 MB/s
.....
Crypto hash algorithm crc32c speed = 2203 MB/s

If you can start your patch with crc32 instruction, you can load crc32c-intel, because they based on the same CPU intstruction - crc32(for crc32c polynomial).

Comment by James A Simmons [ 11/Jul/13 ]

There is no SSE4.2 on the Bulldozer processor. I have loaded crc32c-intel and it does not work. The crc32c instruction does not exist on the Bulldozer. For

static int __init crc32c_intel_mod_init(void)
{
if (cpu_has_xmm4_2)
return crypto_register_shash(&alg);
else
return -ENODEV;
}

-ENODEV is returned.

Comment by Alexander Boyko [ 11/Jul/13 ]

From your patch
libcfs/libcfs/linux/crc32c-pcl-intel-asm_64.S

.rept 128-1	196
.altmacro	197
LABEL crc_ %i	198
.noaltmacro	199
»       crc32q   -i*8(block_0), crc_init	200
»       crc32q   -i*8(block_1), crc1	201
»       crc32q   -i*8(block_2), crc2	202
»       i=(i-1)	203
			.endr

The crc32c instruction does not exist on the Bulldozer.

I think you mean crc32 instruction. So I don`t understand how this patch works on your CPU, probably the next check is invalid

if (cpu_has_xmm4_2)

and Bulldozer does not support xmm4_2 instr. set, and have support crc32 instruction.

Comment by Alexander Boyko [ 11/Jul/13 ]

There is no crc32c instruction, only crc32 for crc32c polynomial.

Comment by Alexander Boyko [ 11/Jul/13 ]

From wiki
The Bulldozer cores support most of the instruction sets implemented by Intel processors available at its introduction (including SSE4.1, SSE4.2, AES, CLMUL, and AVX) as well as new instruction sets proposed by AMD:ABM(Advanced Bit Manipulation),(XOP and FMA4).[2][3]

Comment by James A Simmons [ 11/Jul/13 ]

Interesting. You are right, I over looked the crc32q instruction in the assembly code. I wonder why the crc32c-intel module fails to load then? I will try commenting out the cpu_has_xmm4_2 check in crc32c-intel.c and see if it runs.

Comment by James A Simmons [ 11/Jul/13 ]

Okay I found a sandy bridge system that also has this problem. I put a debug in crc32c-intel. As you can see it does detect and load the module. For some reason the software implementation is used over the hardware one.

[ 2591.132551] Loaded crc32c-intel module
[ 2601.554615] LNet: HW CPU cores: 8, npartitions: 2
[ 2601.569640] alg: No test for crc32 (crc32-table)
[ 2601.574527] alg: No test for adler32 (adler32-zlib)
[ 2601.579638] alg: No test for crc32 (crc32-pclmul)
[ 2601.591562] Lustre: Crypto hash algorithm null speed = -1 MB/s
[ 2602.595505] Lustre: Crypto hash algorithm adler32 speed = 1478 MB/s
[ 2603.599936] Lustre: Crypto hash algorithm crc32 speed = 1426 MB/s
[ 2604.604344] Lustre: Crypto hash algorithm md5 speed = 277 MB/s
[ 2605.608237] Lustre: Crypto hash algorithm sha1 speed = 118 MB/s
[ 2606.612630] Lustre: Crypto hash algorithm sha256 speed = 77 MB/s
[ 2607.616921] Lustre: Crypto hash algorithm sha384 speed = 113 MB/s
[ 2608.621182] Lustre: Crypto hash algorithm sha512 speed = 113 MB/s
[ 2609.625538] Lustre: Crypto hash algorithm crc32c speed = 266 MB/s

Comment by James A Simmons [ 12/Jul/13 ]

Alex you were right. The AMD Bulldozer does have the crc32 instruction. I got it to work the same why I managed to get it work on the sandy bridge machine. I found from debugging that both the crc32c and crc32c-intel modules were being loaded but their is a bug in the way the rhel6 kernel handles which module to use. In the case of my sandy bridge machine the software crc32c was being picked over the intel one. By default RHEL6 builds the crc32c code into the kernel and the crc32c-intel as a module.
When I built both crc32c implementations into the kernel the problem went away, crc32c-intel was correctly picked. That is the work around but this needs to be reported to Red Hat.

Comment by Alexander Boyko [ 15/Jul/13 ]

Could you show performance results crc32c-intel vs LU-3570 patch?

Comment by James A Simmons [ 16/Jul/13 ]

Here are the results I got.

AMD bulldozer
---------------------------------------------------------

with patch

Lustre: Crypto hash algorithm crc32c speed = 1937 MB/s

with crc32c-intel

Lustre: Crypto hash algorithm crc32c speed = 1458 MB/s

Intel Xeon E5-2603
----------------------------------------------------------

with patch
Lustre: Crypto hash algorithm crc32c speed = 3089 MB/s

with crc32c-intel:
Lustre: Crypto hash algorithm crc32c speed = 3061 MB/s

Give it a try. To make it work you need to remove the crc32c-intel
kernel module from your image. Also I placed in the latest patch
a test for XMM2_4 in linux-crypto-crc32c-pclmul.c which can be
removed since crc32 always exist with XMM2_4.

Comment by Alexander Boyko [ 17/Jul/13 ]

James, I think that loading problem was caused by the same .cra_priority for intel module and your patch, the value is 200. I have combined results of checksum algo.

  AMD Buldozer Intel Xeon E5-260
adler 670 1478
crc32-pclmul 2532 1426
crc32c-intel 1458 3061
crc32c-pclmul 1937 3089
Comment by Alexander Boyko [ 17/Jul/13 ]

This is very strange, we got that pclmul of crc32 polynomial is faster than the crc32c(h/w crc32c + pclmul) for AMD CPU. But for Intel CPU it is twice slower. In theory, we can get the same result for crc32c as crc32-pclmul(for AMD) if it will be implemented using only pclmul instruction like crc32-pclmul.

Comment by Alexander Boyko [ 17/Jul/13 ]

Should we add this patch to Lustre if it gives 30% faster crc32c only for AMD?

Comment by James A Simmons [ 18/Jul/13 ]

I think it could be worth while to add this patch. We could have it a module option to use this. By default it is turned off. Alex, Andreas what do you think?

Comment by Andreas Dilger [ 19/Jul/13 ]

Are the crc32c-intel and crc32c-pclmul implementations independent, or is it one or the other? The relative performance difference shouldn't matter, since the OST and client will pick the fastest checksum that is available.

Comment by James A Simmons [ 22/Jul/13 ]

The linux crypto layer picks either crc32c-intel or crc32c-pclmul but not both at the same time. This picking is not done with performance measuring but it is done with the cra_priority flag. That is why I suggested a module flag to turn crc32c pclmul handling on. Now in the 3.10+ kernels a hybrid approach is done that uses crc32 intel instruction for smaller data sets and pclmul for larger data sets. I still like the module flag idea better. It can be up to the deployment team to determine which is better on the hardware they are running.

Comment by Alexander Boyko [ 22/Jul/13 ]

From my point of view, it is better to have separate module and not to add this patch to Lustre sources, -workaround for old kernels.
Andreas, libcfs crypto hash test different algorithms, but does not test different implementations of the same algo. Implementation choice base on the linux kernel crypto api, as James already sad by cra_priority.

Comment by James A Simmons [ 25/Sep/13 ]

Just updated the patch again to address several concerns. Fist we need this patch in the Lustre source base since most people will be running patch less clients which will lack this performance enhancement. The patch now test to see if the kernel is version 3.10 or above and will not build any of the pclmulqdq code Lustre has since it is already included upstream. Send I changed the value of the cra_priority flags to 150 which is higher than the software crc32c of 100 but lower than the crc32c-intel value of 200. This way crc32c-intel will always be preferred.

Generated at Sat Feb 10 01:35:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.