[LU-1201] Lustre crypto hash cleanup Created: 09/Mar/12  Updated: 21/Nov/12  Resolved: 01/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0

Type: Improvement Priority: Minor
Reporter: Alexander Boyko Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Attachments: Microsoft Word lustre-LU1201.xlsx     Microsoft Word lustre-singleclient-comparison.xlsx    
Issue Links:
Related
is related to LU-744 Single client's performance degradati... Resolved
Rank (Obsolete): 4572

 Description   

Lustre use crypto hashes algo in two ways: PTLRPC (ptlrpc/sec_bulk.c), OST(ost/ost_handler.c)/OSC (osc/osc_request.c).
OST,OSC use crc32, crc32c, adler for checksumming (compute_checksum() function)
PTLRPC uses crc32, adler, md5, sha1-512 ( kernel crypto api for all, excluding crc32 adler)
All subsystems go through bulk pages and update checksum.
To resolve conflicts with different implementation of checksumming, a new crypto hash interface is needed at libcfs. It should use kernel crypto api for hash calculation for kernel modules, and lustre implementation for user mode. Previus checksum calculation should be changed to the new libcfs crypto hash api. And adding new hash algo would be a simple task.



 Comments   
Comment by Alexander Boyko [ 09/Mar/12 ]

http://review.whamcloud.com/2276

Comment by Andreas Dilger [ 09/Mar/12 ]

In addition to the questions in the patch, I was wondering why all of the different hash functions were being added?

I don't quite understand how "ptlrpc" hash functions differ from "ost" hash functions? Are you referring to sptlrpc (ie Kerberos)? To confirm, is there a protocol change associated with this patch or not?

If there isn't a good reason to add a hash function, I don't necessarily want to just add them because we can. One reason against this is because there are only a limited number of hash functions we can use in the protocol, and also because I'd like to move to "piece-wise" hashes that could be computed on the page and then aggregated in the RPC without having to recompute the checksum again.

Comment by Shuichi Ihara (Inactive) [ 09/Mar/12 ]

I'm really want to see the performance numbers by this changes. (any improvements with kernel crypto APIs, instead of current crc32c codes?) The reason of what I made a patch crc32c/w h/w accelerator checksum patch a couple years ago was that for single client's performance improvements. (there was only single thread ptlrpc, checksum on the client was calculated on that thread without in parallel. So, needed off-load checksum calculation)
The server side was not much performance difference regardless the checksum=on (alder or crc32c) or checksum=off. Because, the servers can calculate the checksum in parallel.

That's true that single client's performance was improved by 40-50% (900MB/sec -> 1.5GB/sec) in 1.8.x code with crc32c patch, but still hitting 1.5GB/s performance limitation due to single threaded ptlrpc. The recent 2.2 branch supports multiple ptlrpc, and it's breaking 1.6GB/s limitation right now and really higher than previous 1.8's numbers.

the single client performance on 2.2 branch, crc32c algorithm still is faster than alder, but crc32c, adler and no checksum's performance number are getting close on 2.2 now..

Comment by Alexander Boyko [ 12/Mar/12 ]

This is logs from libcfs cfs_crypto_test_hashes()
Crypto hash algorithm null speed = -1 MB/s
Crypto hash algorithm adler32 speed = 2241 MB/s
Crypto hash algorithm crc32 speed = 360 MB/s
Crypto hash algorithm md5 speed = 428 MB/s
Crypto hash algorithm sha1 speed = 178 MB/s
Crypto hash algorithm sha256 speed = 117 MB/s
Crypto hash algorithm sha384 speed = 174 MB/s
Crypto hash algorithm sha512 speed = 175 MB/s
Crypto hash algorithm crc32c speed = 2602 MB/s (for sw table ~330 MB/s)

VMware, cpu i5-2415M, for crc32c used crc32c_intel module
Libcfs use 1MB buffer for testing, and calculate hash during 1sec with the same buffer.

This patch provide libcfs interface for crypto hashes.
>>In addition to the questions in the patch, I was wondering why all of the different hash functions were being added?
because SPTLRPC compute hash for this num of algo. No protocol changes, hash calculation only.

This patch drop lustre crc32c implementation because linux kernel crypto have two implementation of crc32c 1) table sw 2) hw (crc32c_intel same code like lustre have). One more - supporting of crc32 instruction should be detected at run time, not in compile (lustre #ifdef X86_FEATURE_XMM4_2).
LU-241 add crc32c algo for lustre osc/ost only, no sptlrpc.
Now(this patch), all crypto hashes algorithms are at one place, same interface, and all parts of lustre can use it.

Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ]

In the the current implementation, crc32c (and fastest algorithm) should be detected automatically at the run time. I've tested same rpm packages on the clients with both of Nehalem based CPU and more old CPU(Core2 Duo) which doesn't have sse4_2 feature.
e.g) mounted a lustre filesystem from both clients. Here is the deafult checksum_type after just mounted.

  • Nehalem based CPU's client
    # cat /proc/fs/lustre/osc/*/checksum_type
    crc32 adler [crc32c]
    crc32 adler [crc32c]
  • Core2 Due CPU's client
    # cat /proc/fs/lustre/osc/*/checksum_type
    crc32 [adler]
    crc32 [adler]
Comment by Alexander Boyko [ 12/Mar/12 ]

yes you are right
#ifdef X86_FEATURE_XMM4_2
if (cpu_has_xmm4_2)
ret |= OBD_CKSUM_CRC32C;
#endif

This patch don`t improve any checksum calculation. It present libcfs crypto hash interface for checksum calculation and fix all lustre checksums to it. Base on kernel crypto api we got crc32c sw implementation, so clients wich don`t support sse_4.2 can use crc32c.

Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ]

OK, but, why we need to use crc32c sw implementation, instead of adler when if the client doesn't have Nehalem processor? (btw, Nehalem processor was released out 3 years ago...) Your numbers showed us adler (2241 MB/s) is faster than crc32 sw table (~330 MB/s). Also, my recent checksum performance testing on 2.2 branch, with adler, crc32c's and no-checksum's performance are very similar on the single client.

Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ]

Here is qucik performance results on between single OSS and single client over the QDR Infiniband. I tested this on ramdisk (created 8 x OST on ramdisk) instead of general storage to show real checkum performance. Tested server are 2 x HP DL360 (2.93GHz x 12 CPU cores, 48GB memory)

WRITE(2.1.1)
#proc	nochksum crc32	adler	crc32c
1	 759 	  289 	 550 	
2	1367 	  567 	1063 	
4	2663 	 1146 	2023 	
8	3084 	 2183 	1309 	

WRITE(2.2 branch)				
1	 748 	  747 	 748 	 749 
2	1427 	 1401 	1378 	1387 
4	2669 	 2465 	2594 	2621 
8	3152 	 2470 	3073 	3148 
					
READ(2.1.1)
	nochksum crc32	adler	crc32c
1	1031 	  323 	 829 	
2	1785 	  360 	 921 	
4	2456 	  373 	 963 	
8	2145 	  376 	 954 	

READ(2.2 branch)				
1	 998 	  925 	 954 	 938 
2	1826 	 1703 	1716 	1691 
4	2986 	 2653 	2850 	2836 
8	2875 	 2724 	2659 	2855

Comment by Andreas Dilger [ 12/Mar/12 ]

Alexander, thank you for the checksum performance results. Based on these numbers, the adler32 algorithm should be used for nodes that do not have hardware crc32c support, but h/w crc32c is slightly faster so it should be preferred if available. Does this match your test results? What is the output of "lctl get_param osc.*.checksum_type" (as Ihara showed in his recent comment)?

Is the printing of "Crypto hash algorithm null speed = -1 MB/s" a defect?

Ihara, thank you also for the benchmark results. What test is being run? Is it a single-threaded or multi-threaded load? The multi-threaded ptlrpcd change, along with moving the bulk RPC checksums to be in ptlrpcd context is what improved this performance.

Comment by Andreas Dilger [ 12/Mar/12 ]

Alexander, also note that if the OST does not support crc32c in hardware, then it should not reply to the client that crc32c is supported. It is much more efficient for the OSS in this case to use adler32, since it uses much less CPU, and the speed on the client is approximately the same. It is more important that the server CPU usage is as efficient as possible.

Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ]

Andreas, the numbers were tested on IOR with multiple processes

{1,2,4 and 8}

on the single client.
IOR -a POSIX -k -t 1m -b 1g -F -w -o /mnt/lustre/file
sync; echo 1 > /proc/sys/vm/drop_caches
IOR -a POSIX -t 1m -b 1g -F -r -o /mnt/lustre/file

The numbers were on 2.1.1 is pretty same as what we saw 1.8.x. And, yes, the checksum with any algorithms are improved on 2.2.

Comment by Alexander Boyko [ 13/Mar/12 ]

>>Does this match your test results? What is the output of "lctl get_param osc.*.checksum_type" (as Ihara showed in his recent comment)?
Yes, the fastest algo would be used by default, if server support it.
>>Is the printing of "Crypto hash algorithm null speed = -1 MB/s" a defect?
No, upper layer disable checksumming for ost/osc. -1 means no algo, or unsupported.

>>Alexander, also note that if the OST does not support crc32c in hardware, then it should not reply to the client that crc32c is supported. It is much more efficient for the OSS in this case to use adler32, since it uses much less CPU, and the speed on the client is approximately the same. It is more important that the server CPU usage is as efficient as possible.

In this point of view the server should not reply to the client that crc32 is supported too. Because adler based on simple add instruction and all hardware support it. One reason to support crc32 is very old client. With this patch the code always support crc32c, may be this should be changed for sw implementation in the server reply.

I think, that not only speed performance is the main parameter for algo, need to look at cpu load on real hardware.

Comment by Alexander Boyko [ 13/Mar/12 ]
WRITE(2.1.1)
#proc	nochksum crc32	adler	crc32c
1	 759 	  289 	 550 	
2	1367 	  567 	1063 	
4	2663 	 1146 	2023 	
8	3084 	 2183 	1309

Why adler for 8 clients is slower than the crc32 ?

Comment by Alexander Boyko [ 13/Mar/12 ]

The same question for read

READ(2.2 branch)				
1	 998 	  925 	 954 	 938 
2	1826 	 1703 	1716 	1691 
4	2986 	 2653 	2850 	2836 
8	2875 	 2724 	2659 	2855

Base on checksum performance adler32 is 6-7 times faster than crc32 table implementation. But raw IOR lustre result show different value.

Comment by Shuichi Ihara (Inactive) [ 13/Mar/12 ]

> Why adler for 8 clients is slower than the crc32 ?
don't know why, still, but it was same results on 2 x OSS, 8 x OST(4 OSTs per OSS) configuration, instead of 1 x OSS, 8 OSTs.

This was tested on ramdisk /dev/ram, it should be also tested with real disks, instead of ram disk.

Comment by Andreas Dilger [ 14/Mar/12 ]

In this point of view the server should not reply to the client that crc32 is supported too. Because adler based on simple add instruction and all hardware support it. One reason to support crc32 is very old client. With this patch the code always support crc32c, may be this should be changed for sw implementation in the server reply.

The need to support old crc32-only clients has passed. According to https://bugzilla.lustre.org/show_bug.cgi?id=13805, the adler32 support was landed in Lustre 1.6.5, and while 1.8 clients and servers need to maintain interoperability with 1.6 clients and servers (these apparently still exist in the world), the 2.3 servers where this patch will land only need interoperability with 1.8 clients.

I would suggest that the OST use the cfs_crypto_test_hashes() results, and eliminate any results which are slower than, say, 1/2 of the adler32 value. It would be worthwhile to include a comment referencing b=13805 that added adler32 in 1.6.5 to make it clear that all 1.8.x clients will be able to support this.

Comment by Nathan Rutman [ 03/Apr/12 ]

Andreas - you marked the patch "I would prefer that you didn't submit this" – is this only because of the default algorithm selection issue? Would it make more sense here to send an explicit OSS preference back to the clients instead of "faking" the connect flags?
The broader cleanup and unification of the sptlrpc and bulk checksum is acceptable to you?

Shuichi - your reported 2.2 numbers are using this patch?

Comment by Andreas Dilger [ 03/Apr/12 ]

Nathan, it is only because of the default algorithm selection. The rest of the patch is generally OK I think. My only remaining

As for sending the OSS preference back - that would be better, but more complex. It would need a speed/ratio to be sent back for each checksum so that the client could pick the best combination of local and remote checksums. However, I tend to think that this decision is better for the server to make based on the algorithms that the client understands, and the speed of the algorithms that the server supports.

It would also be possible for the client to bias the checksum selection by excluding algorithms like software CRC32/CRC32c if they are extremely slow and then letting the OST chose from the remainder based on performance. In essence, this is what is done in the existing code in a much less flexible manner - the client excludes CRC32c if it is not implemented in hardware, as does the OST, and the remaining choices are sent back to the client (ADLER normally, and CRC32 as a fallback for 1.6 clients). Having actual performance metrics could makes this selection less "binary", which will become increasingly important as new algorithms are added.

Essentially, what I'm proposing is an improvement of the existing selection semantics that take both client and server performance into account, rather than ignoring the server performance entirely as the proposed patch does. It could be implemented in two ways, and I think my original proposal of having the client chose the fastest algorithm is not as good as having the server do it. New proposal below:

  • client computes checksum speeds
  • client discards algorithms below some performance threshold
    • e.g. 3/4 of (geometric mean, average, fastest), ...
    • this will keep all algorithms if they are reasonably close in performance, otherwise using just the mean/median would always exclude some algorithm(s) needlessly
    • and always includes ADLER for interoperability
  • we do not need to include CRC32
    • this can be very slow for the client, even if it is fast on the server
    • ADLER is faster in software and is universally supported
    • CEC32 may be included if it is fast (e.g. implemented in hardware)
  • client sends (possibly reduced) list of acceptable checksums (bits) to server in connect
  • server discards client algorithms it doesn't understand, and those below the performance threshold calculated based on remaining algorithms
  • the presence of ADLER would ensure that there is always some algorithm which is common to both the client and server
  • only the fastest server algorithm(s) will be returned to the client, if there is a big difference
  • client can make a final choice if there is one
Comment by Nathan Rutman [ 03/Apr/12 ]

My only concern is that this removes the ability of the users/admin to make the decision. The other approach would be to just negotiate raw capabilities, always default to adler, and let the admin make the final selection with lctl conf_param.

Comment by Andreas Dilger [ 03/Apr/12 ]

Nathan, note that this is only describing the behaviour of selecting the default checksum algorithm. This isn't really any different than what is done today, with the exception that instead of using the actual checksum performance the old code used hard-coded defaults based on the order that the checksums were listed in obd_cksum_pack() (or similar, can't check right now).

If there is a checksum algorithm listed via conf_param it would override the one chosen by default. I guess the caveat is that if the algorithm is excluded because it is much slower than the others, then it may not be available for selection. That is true of the code today also - if crc32c is not in hardware on either the client or server, then it is excluded and couldn't be selected by the administrator either.

Comment by Shuichi Ihara (Inactive) [ 04/Apr/12 ]

I've never tested this patches on 2.2. All results were on vanilla-2.2 branch.
Here is single client's performance ratio(tested with 12 IOzone threads). This is tested on real disks, instead of ramdisk. The results overall, it seems to be reasonable.

Performance ratio(Write)
            v2.2
nochecksum  1.00
adler       0.88
crc32       0.68
crc32c      0.93

Performance ratio(Read)
            v2.2
nochecksum  1.00
adler       0.88
crc32       0.71
crc32c      0.91
Comment by Nathan Rutman [ 04/Apr/12 ]

I guess the caveat is that if the algorithm is excluded because it is much slower than the others, then it may not be available for selection. That is true of the code today also - if crc32c is not in hardware on either the client or server, then it is excluded and couldn't be selected by the administrator either.

Exactly my concern. If there are tons of clients and few servers, it may make sense to use the fastest server algorithm regardless of the client algo speed to optimize the overall system rate. That's the reason we wanted to be able to use the SW crc-32c algo (on the clients) in the first place.

Comment by Andreas Dilger [ 04/Apr/12 ]

If the concern is the client eliminating the checksum algorithm choices prematurely, then the client could send all of the known algorithms to the OST. Only the OST would drop the algorithms that are totally underperforming (e.g. crc32 or crc32c in s/w). The final performance/config selection would be in the hands of the client/administrator. That would ensure that the client doesn't pick a locally fast algorithm (e.g. crc32c h/w) that would significantly impact the OST CPU usage (e.g. crc32c s/w), without reducing the (sensible) choices available to the user.

Comment by Nathan Rutman [ 04/Apr/12 ]

That sounds like a good solution:
1. Client sends all available local algos
2. server drops non-client-supported from list, then replies with remaining algos that perform at 50% or better of the fastest algo remaining on server
3. client chooses locally fastest of what's left as the default.

Comment by Shuichi Ihara (Inactive) [ 20/Apr/12 ]

A quick question.
Do we keep RHE5 patchless client on lustre-2.3?
If so, RHEL5's client may see some performance drops on 2.3 from 2.2 if they upgrade.
Because, crc32c is implemented in the lustre itself and crc32c-intel is not available on that kernel, so alder will be selected on 2.3.

Comment by Andreas Dilger [ 20/Apr/12 ]

Ihara, while it is true that there may be some minor degradation in the case of RHEL 5 clients, based on the test results you posted this performance loss will be minor (or still an improvement over earlier versions of Lustre) due to the multi-threaded ptlrpcd speedups.

Also, the chance of users wanting to stick with RHEL5 for stanility, but moving to a new development version of Lustre is not very likely.

Comment by Alexander Boyko [ 22/Apr/12 ]

I do test on RH5 with kernel version 2.6.18-238.19.1, it support crc32c hw (kernel compile), and have better crc32c performance than 2.6.32-... kernels.

Comment by Andreas Dilger [ 30/Apr/12 ]

Alexander, Ihara,
have you done any performance comparison with checksums enabled on 1.8 for comparison?

Comment by Shuichi Ihara (Inactive) [ 01/May/12 ]

I had couple of benchmarks on 1.8 and 2.2 for comparison, but not yet with this patches. I'm thinking to do same benchmarking on 1.8 and 2.2 with this patches in a couple of days.
btw, we are seeing single client's performance regressions on 2.x due to LU-744. So, at this moment, we might need to run with file size < client's memory size for more fair comparison.. However, In the case of file size > client's memory size, we still see less client's performance regardless checksum=on/off.

Comment by Shuichi Ihara (Inactive) [ 05/May/12 ]

Hit kernel panic with this patches on Sandybridge server. I just filed on LU-1379.

Comment by Alexander Boyko [ 05/May/12 ]

Can you attach your config.h? Looks like invalid configure, I can`t reproduce issue on the same kernel.

Comment by Shuichi Ihara (Inactive) [ 05/May/12 ]

ok, the problem is fixed with correct config.h.

Comment by Shuichi Ihara (Inactive) [ 07/May/12 ]

Andreas,
attached is an checksum comparison on various lustre version. most of 2.2 numbers (weather checksum is enabled or not) were lower than 2.1 and 1.8 except read performance. we might be having some regressions on 2.2 client. need to open new ticket of this?

Comment by Andreas Dilger [ 07/May/12 ]

Ihara, yes this spreadsheet should be attached to a new bug, along with details of the test being run. Are these results from testing with 24x ramdisk OSTs, as before? Jinshan is looking at this problem, please assign it to him.

The new results are contrary to the previous test results in this bug, which showed 2.2 being faster than 2.1 for 8x OSTs.

Comment by Andreas Dilger [ 10/May/12 ]

It looks like there is already a bug LU-744 for tracking the 2.x performance degradation.

Comment by Alexander Boyko [ 11/May/12 ]

Ihara, have you checksumming benchmark results without this patch(for the same hw and configuration)?

Comment by Shuichi Ihara (Inactive) [ 13/May/12 ]

No big performance differences between 2.2 and 2.2/w this patches, but still lower than 2.1.2.

LU-744, I've filed before, but I think this is another regression compared to what we saw here.
As far as I tested on 2.2, the performance goes down when write/read size exceed client's memory size. This is LU-744 and same problem happens on 2.1.x.

In order to avoid this regression for this checksum comparison among on 1.8.7, 2.1.2 and 2.2, I used file size < client's memory size.

Anyway, I don't think this regression is related to LU-1201, I will open the new ticket.

Comment by Jinshan Xiong (Inactive) [ 14/May/12 ]

Hi Ihara, in LU-744 you saw 2.2 clients performed better than 2.1, however, here you have seen the opposite. Is this only because the file size is less than memory size?

Comment by Shuichi Ihara (Inactive) [ 16/May/12 ]

Hi Jay,

Yes, I didn't write/read the larger file size than client's memory size to avoid LU-744 issue.

I opened new ticket on LU-1408 for single client's performance regression on 2.2. Also, during my testing for LU-1408, I saw another big performance difference between the latest b2_1 branch and 2.1.2RC0 tag. This is filed on LU-1413 as another regression.

So, please move foward to on LU-744, LU-1408 and LU-1413 for single client performance regression and different behaviors on various lustre version.

Comment by Peter Jones [ 11/Jun/12 ]

Landed for 2.3

Comment by Peter Jones [ 11/Jun/12 ]

patch reverted as breaks build for external OFED

Comment by Alexander Boyko [ 15/Jun/12 ]

new review req at http://review.whamcloud.com/#change,3098
two lines code was changed. Build succcessful.

Comment by Peter Jones [ 01/Jul/12 ]

Landed for 2.3

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-337
Xyratex-bug-id: MRP-471

Generated at Sat Feb 10 01:14:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.