[LU-1201] Lustre crypto hash cleanup Created: 09/Mar/12 Updated: 21/Nov/12 Resolved: 01/Jul/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.3.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Alexander Boyko | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 4572 | ||||||||
| Description |
|
Lustre use crypto hashes algo in two ways: PTLRPC (ptlrpc/sec_bulk.c), OST(ost/ost_handler.c)/OSC (osc/osc_request.c). |
| Comments |
| Comment by Alexander Boyko [ 09/Mar/12 ] |
| Comment by Andreas Dilger [ 09/Mar/12 ] |
|
In addition to the questions in the patch, I was wondering why all of the different hash functions were being added? I don't quite understand how "ptlrpc" hash functions differ from "ost" hash functions? Are you referring to sptlrpc (ie Kerberos)? To confirm, is there a protocol change associated with this patch or not? If there isn't a good reason to add a hash function, I don't necessarily want to just add them because we can. One reason against this is because there are only a limited number of hash functions we can use in the protocol, and also because I'd like to move to "piece-wise" hashes that could be computed on the page and then aggregated in the RPC without having to recompute the checksum again. |
| Comment by Shuichi Ihara (Inactive) [ 09/Mar/12 ] |
|
I'm really want to see the performance numbers by this changes. (any improvements with kernel crypto APIs, instead of current crc32c codes?) The reason of what I made a patch crc32c/w h/w accelerator checksum patch a couple years ago was that for single client's performance improvements. (there was only single thread ptlrpc, checksum on the client was calculated on that thread without in parallel. So, needed off-load checksum calculation) That's true that single client's performance was improved by 40-50% (900MB/sec -> 1.5GB/sec) in 1.8.x code with crc32c patch, but still hitting 1.5GB/s performance limitation due to single threaded ptlrpc. The recent 2.2 branch supports multiple ptlrpc, and it's breaking 1.6GB/s limitation right now and really higher than previous 1.8's numbers. the single client performance on 2.2 branch, crc32c algorithm still is faster than alder, but crc32c, adler and no checksum's performance number are getting close on 2.2 now.. |
| Comment by Alexander Boyko [ 12/Mar/12 ] |
|
This is logs from libcfs cfs_crypto_test_hashes() VMware, cpu i5-2415M, for crc32c used crc32c_intel module This patch provide libcfs interface for crypto hashes. This patch drop lustre crc32c implementation because linux kernel crypto have two implementation of crc32c 1) table sw 2) hw (crc32c_intel same code like lustre have). One more - supporting of crc32 instruction should be detected at run time, not in compile (lustre #ifdef X86_FEATURE_XMM4_2). |
| Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ] |
|
In the the current implementation, crc32c (and fastest algorithm) should be detected automatically at the run time. I've tested same rpm packages on the clients with both of Nehalem based CPU and more old CPU(Core2 Duo) which doesn't have sse4_2 feature.
|
| Comment by Alexander Boyko [ 12/Mar/12 ] |
|
yes you are right This patch don`t improve any checksum calculation. It present libcfs crypto hash interface for checksum calculation and fix all lustre checksums to it. Base on kernel crypto api we got crc32c sw implementation, so clients wich don`t support sse_4.2 can use crc32c. |
| Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ] |
|
OK, but, why we need to use crc32c sw implementation, instead of adler when if the client doesn't have Nehalem processor? (btw, Nehalem processor was released out 3 years ago...) Your numbers showed us adler (2241 MB/s) is faster than crc32 sw table (~330 MB/s). Also, my recent checksum performance testing on 2.2 branch, with adler, crc32c's and no-checksum's performance are very similar on the single client. |
| Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ] |
|
Here is qucik performance results on between single OSS and single client over the QDR Infiniband. I tested this on ramdisk (created 8 x OST on ramdisk) instead of general storage to show real checkum performance. Tested server are 2 x HP DL360 (2.93GHz x 12 CPU cores, 48GB memory) WRITE(2.1.1) #proc nochksum crc32 adler crc32c 1 759 289 550 2 1367 567 1063 4 2663 1146 2023 8 3084 2183 1309 WRITE(2.2 branch) 1 748 747 748 749 2 1427 1401 1378 1387 4 2669 2465 2594 2621 8 3152 2470 3073 3148 READ(2.1.1) nochksum crc32 adler crc32c 1 1031 323 829 2 1785 360 921 4 2456 373 963 8 2145 376 954 READ(2.2 branch) 1 998 925 954 938 2 1826 1703 1716 1691 4 2986 2653 2850 2836 8 2875 2724 2659 2855 |
| Comment by Andreas Dilger [ 12/Mar/12 ] |
|
Alexander, thank you for the checksum performance results. Based on these numbers, the adler32 algorithm should be used for nodes that do not have hardware crc32c support, but h/w crc32c is slightly faster so it should be preferred if available. Does this match your test results? What is the output of "lctl get_param osc.*.checksum_type" (as Ihara showed in his recent comment)? Is the printing of "Crypto hash algorithm null speed = -1 MB/s" a defect? Ihara, thank you also for the benchmark results. What test is being run? Is it a single-threaded or multi-threaded load? The multi-threaded ptlrpcd change, along with moving the bulk RPC checksums to be in ptlrpcd context is what improved this performance. |
| Comment by Andreas Dilger [ 12/Mar/12 ] |
|
Alexander, also note that if the OST does not support crc32c in hardware, then it should not reply to the client that crc32c is supported. It is much more efficient for the OSS in this case to use adler32, since it uses much less CPU, and the speed on the client is approximately the same. It is more important that the server CPU usage is as efficient as possible. |
| Comment by Shuichi Ihara (Inactive) [ 12/Mar/12 ] |
|
Andreas, the numbers were tested on IOR with multiple processes {1,2,4 and 8} on the single client. The numbers were on 2.1.1 is pretty same as what we saw 1.8.x. And, yes, the checksum with any algorithms are improved on 2.2. |
| Comment by Alexander Boyko [ 13/Mar/12 ] |
|
>>Does this match your test results? What is the output of "lctl get_param osc.*.checksum_type" (as Ihara showed in his recent comment)? >>Alexander, also note that if the OST does not support crc32c in hardware, then it should not reply to the client that crc32c is supported. It is much more efficient for the OSS in this case to use adler32, since it uses much less CPU, and the speed on the client is approximately the same. It is more important that the server CPU usage is as efficient as possible. In this point of view the server should not reply to the client that crc32 is supported too. Because adler based on simple add instruction and all hardware support it. One reason to support crc32 is very old client. With this patch the code always support crc32c, may be this should be changed for sw implementation in the server reply. I think, that not only speed performance is the main parameter for algo, need to look at cpu load on real hardware. |
| Comment by Alexander Boyko [ 13/Mar/12 ] |
WRITE(2.1.1) #proc nochksum crc32 adler crc32c 1 759 289 550 2 1367 567 1063 4 2663 1146 2023 8 3084 2183 1309 Why adler for 8 clients is slower than the crc32 ? |
| Comment by Alexander Boyko [ 13/Mar/12 ] |
|
The same question for read READ(2.2 branch) 1 998 925 954 938 2 1826 1703 1716 1691 4 2986 2653 2850 2836 8 2875 2724 2659 2855 Base on checksum performance adler32 is 6-7 times faster than crc32 table implementation. But raw IOR lustre result show different value. |
| Comment by Shuichi Ihara (Inactive) [ 13/Mar/12 ] |
|
> Why adler for 8 clients is slower than the crc32 ? This was tested on ramdisk /dev/ram, it should be also tested with real disks, instead of ram disk. |
| Comment by Andreas Dilger [ 14/Mar/12 ] |
The need to support old crc32-only clients has passed. According to https://bugzilla.lustre.org/show_bug.cgi?id=13805, the adler32 support was landed in Lustre 1.6.5, and while 1.8 clients and servers need to maintain interoperability with 1.6 clients and servers (these apparently still exist in the world), the 2.3 servers where this patch will land only need interoperability with 1.8 clients. I would suggest that the OST use the cfs_crypto_test_hashes() results, and eliminate any results which are slower than, say, 1/2 of the adler32 value. It would be worthwhile to include a comment referencing b=13805 that added adler32 in 1.6.5 to make it clear that all 1.8.x clients will be able to support this. |
| Comment by Nathan Rutman [ 03/Apr/12 ] |
|
Andreas - you marked the patch "I would prefer that you didn't submit this" – is this only because of the default algorithm selection issue? Would it make more sense here to send an explicit OSS preference back to the clients instead of "faking" the connect flags? Shuichi - your reported 2.2 numbers are using this patch? |
| Comment by Andreas Dilger [ 03/Apr/12 ] |
|
Nathan, it is only because of the default algorithm selection. The rest of the patch is generally OK I think. My only remaining As for sending the OSS preference back - that would be better, but more complex. It would need a speed/ratio to be sent back for each checksum so that the client could pick the best combination of local and remote checksums. However, I tend to think that this decision is better for the server to make based on the algorithms that the client understands, and the speed of the algorithms that the server supports. It would also be possible for the client to bias the checksum selection by excluding algorithms like software CRC32/CRC32c if they are extremely slow and then letting the OST chose from the remainder based on performance. In essence, this is what is done in the existing code in a much less flexible manner - the client excludes CRC32c if it is not implemented in hardware, as does the OST, and the remaining choices are sent back to the client (ADLER normally, and CRC32 as a fallback for 1.6 clients). Having actual performance metrics could makes this selection less "binary", which will become increasingly important as new algorithms are added. Essentially, what I'm proposing is an improvement of the existing selection semantics that take both client and server performance into account, rather than ignoring the server performance entirely as the proposed patch does. It could be implemented in two ways, and I think my original proposal of having the client chose the fastest algorithm is not as good as having the server do it. New proposal below:
|
| Comment by Nathan Rutman [ 03/Apr/12 ] |
|
My only concern is that this removes the ability of the users/admin to make the decision. The other approach would be to just negotiate raw capabilities, always default to adler, and let the admin make the final selection with lctl conf_param. |
| Comment by Andreas Dilger [ 03/Apr/12 ] |
|
Nathan, note that this is only describing the behaviour of selecting the default checksum algorithm. This isn't really any different than what is done today, with the exception that instead of using the actual checksum performance the old code used hard-coded defaults based on the order that the checksums were listed in obd_cksum_pack() (or similar, can't check right now). If there is a checksum algorithm listed via conf_param it would override the one chosen by default. I guess the caveat is that if the algorithm is excluded because it is much slower than the others, then it may not be available for selection. That is true of the code today also - if crc32c is not in hardware on either the client or server, then it is excluded and couldn't be selected by the administrator either. |
| Comment by Shuichi Ihara (Inactive) [ 04/Apr/12 ] |
|
I've never tested this patches on 2.2. All results were on vanilla-2.2 branch. Performance ratio(Write)
v2.2
nochecksum 1.00
adler 0.88
crc32 0.68
crc32c 0.93
Performance ratio(Read)
v2.2
nochecksum 1.00
adler 0.88
crc32 0.71
crc32c 0.91
|
| Comment by Nathan Rutman [ 04/Apr/12 ] |
Exactly my concern. If there are tons of clients and few servers, it may make sense to use the fastest server algorithm regardless of the client algo speed to optimize the overall system rate. That's the reason we wanted to be able to use the SW crc-32c algo (on the clients) in the first place. |
| Comment by Andreas Dilger [ 04/Apr/12 ] |
|
If the concern is the client eliminating the checksum algorithm choices prematurely, then the client could send all of the known algorithms to the OST. Only the OST would drop the algorithms that are totally underperforming (e.g. crc32 or crc32c in s/w). The final performance/config selection would be in the hands of the client/administrator. That would ensure that the client doesn't pick a locally fast algorithm (e.g. crc32c h/w) that would significantly impact the OST CPU usage (e.g. crc32c s/w), without reducing the (sensible) choices available to the user. |
| Comment by Nathan Rutman [ 04/Apr/12 ] |
|
That sounds like a good solution: |
| Comment by Shuichi Ihara (Inactive) [ 20/Apr/12 ] |
|
A quick question. |
| Comment by Andreas Dilger [ 20/Apr/12 ] |
|
Ihara, while it is true that there may be some minor degradation in the case of RHEL 5 clients, based on the test results you posted this performance loss will be minor (or still an improvement over earlier versions of Lustre) due to the multi-threaded ptlrpcd speedups. Also, the chance of users wanting to stick with RHEL5 for stanility, but moving to a new development version of Lustre is not very likely. |
| Comment by Alexander Boyko [ 22/Apr/12 ] |
|
I do test on RH5 with kernel version 2.6.18-238.19.1, it support crc32c hw (kernel compile), and have better crc32c performance than 2.6.32-... kernels. |
| Comment by Andreas Dilger [ 30/Apr/12 ] |
|
Alexander, Ihara, |
| Comment by Shuichi Ihara (Inactive) [ 01/May/12 ] |
|
I had couple of benchmarks on 1.8 and 2.2 for comparison, but not yet with this patches. I'm thinking to do same benchmarking on 1.8 and 2.2 with this patches in a couple of days. |
| Comment by Shuichi Ihara (Inactive) [ 05/May/12 ] |
|
Hit kernel panic with this patches on Sandybridge server. I just filed on |
| Comment by Alexander Boyko [ 05/May/12 ] |
|
Can you attach your config.h? Looks like invalid configure, I can`t reproduce issue on the same kernel. |
| Comment by Shuichi Ihara (Inactive) [ 05/May/12 ] |
|
ok, the problem is fixed with correct config.h. |
| Comment by Shuichi Ihara (Inactive) [ 07/May/12 ] |
|
Andreas, |
| Comment by Andreas Dilger [ 07/May/12 ] |
|
Ihara, yes this spreadsheet should be attached to a new bug, along with details of the test being run. Are these results from testing with 24x ramdisk OSTs, as before? Jinshan is looking at this problem, please assign it to him. The new results are contrary to the previous test results in this bug, which showed 2.2 being faster than 2.1 for 8x OSTs. |
| Comment by Andreas Dilger [ 10/May/12 ] |
|
It looks like there is already a bug |
| Comment by Alexander Boyko [ 11/May/12 ] |
|
Ihara, have you checksumming benchmark results without this patch(for the same hw and configuration)? |
| Comment by Shuichi Ihara (Inactive) [ 13/May/12 ] |
|
No big performance differences between 2.2 and 2.2/w this patches, but still lower than 2.1.2.
In order to avoid this regression for this checksum comparison among on 1.8.7, 2.1.2 and 2.2, I used file size < client's memory size. Anyway, I don't think this regression is related to |
| Comment by Jinshan Xiong (Inactive) [ 14/May/12 ] |
|
Hi Ihara, in |
| Comment by Shuichi Ihara (Inactive) [ 16/May/12 ] |
|
Hi Jay, Yes, I didn't write/read the larger file size than client's memory size to avoid I opened new ticket on So, please move foward to on |
| Comment by Peter Jones [ 11/Jun/12 ] |
|
Landed for 2.3 |
| Comment by Peter Jones [ 11/Jun/12 ] |
|
patch reverted as breaks build for external OFED |
| Comment by Alexander Boyko [ 15/Jun/12 ] |
|
new review req at http://review.whamcloud.com/#change,3098 |
| Comment by Peter Jones [ 01/Jul/12 ] |
|
Landed for 2.3 |
| Comment by Nathan Rutman [ 21/Nov/12 ] |