[LU-13393] t10crc4K/512 algorithm in rhel8.1 kernel is slower than rhel7.7 Created: 26/Mar/20  Updated: 30/Apr/20  Resolved: 30/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

master, rhel8.1 (4.18.0-147.el8.x86_64)


Issue Links:
Related
is related to LU-13391 no print checksum speed for t10pi che... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

t10crc4K/512 algorithm in rhel8.1 kernel is slower than rhel7.7

The performance with T10PI checksum algorithm of t10crc4K/512 in rhel8.1 kernel is broken.
If client is running with rhel8.1 kernel and enabled t10crc4K/512 checksum, that client performance is much slower than rhel7.7 kernel with enabling same t10crc4K/512 checksum.
Here is test configuration and results.

Configuration

1 x client
1 x Platinum 8160, 96GB memory, 1 x IB-EDR
(lctl set_param osc.*.max_pages_per_rpc=16M osc.*.max_rpcs_in_flight=16 osc.*.max_dirty_mb=512 llite.*.max_read_ahead_mb=2048 osc.*.checksum_type=t10crc4K)

Test resutl on RHEL7.7 (3.10.0-1062.el7.x86_64)

PPN=1
mpirun  --allow-run-as-root -np 1 ior -w -r -t 1m -b 256g -e -F -o /testfs/s/file
Max Write: 1981.81 MiB/sec (2078.07 MB/sec)
Max Read:  2685.01 MiB/sec (2815.44 MB/sec)

PPN=16
mpirun  --allow-run-as-root -np 16 ior -w -r -t 1m -b 16g -e -F -o /testfs/file
Max Write: 9887.55 MiB/sec (10367.84 MB/sec)
Max Read:  11212.37 MiB/sec (11757.03 MB/sec)

Test resutl on RHEL8.1 (4.18.0-147.el8.x86_64)

PPN=1
mpirun  --allow-run-as-root -np 1 ior -w -r -t 1m -b 256g -e -F -o /testfs/s/file
Max Write: 1703.20 MiB/sec (1785.94 MB/sec)
Max Read:  758.24 MiB/sec (795.07 MB/sec)

PPN=16
mpirun  --allow-run-as-root -np 16 ior -w -r -t 1m -b 16g -e -F -o /testfs/file
Max Write: 6741.36 MiB/sec (7068.83 MB/sec)
Max Read:  5821.17 MiB/sec (6103.94 MB/sec)

Even algorithm performance test indicated t10crc4K/512 algorithm in rhel8.1 is slow against rhel7.7 kernel. (30x slower.)

RHEL7.7 (3.10.0-1062.el7.x86_64)

obd_t10_performance_test() T10 checksum algorithm t10ip512 speed = 13015 MB/s
obd_t10_performance_test() T10 checksum algorithm t10ip4K speed = 16855 MB/s
obd_t10_performance_test() T10 checksum algorithm t10crc512 speed = 2551 MB/s
obd_t10_performance_test() T10 checksum algorithm t10crc4K speed = 9231 MB/s

RHEL8.1 (4.18.0-147.el8.x86_64)

obd_t10_performance_test() T10 checksum algorithm t10ip512 speed = 13395 MB/s
obd_t10_performance_test() T10 checksum algorithm t10ip4K speed = 19267 MB/s
obd_t10_performance_test() T10 checksum algorithm t10crc512 speed = 339 MB/s
obd_t10_performance_test() T10 checksum algorithm t10crc4K speed = 342 MB/s


 Comments   
Comment by Li Xi [ 27/Mar/20 ]

When benchmarking the performance, Lustre uses crc_t10dif() function to calculate the checksum of t10crc512/t10crc4K. Comparing the performance of t10crc* and t10ip*, looks like crc_t10dif() is much slower than expected. There must be something wrong with it.

Comment by Li Xi [ 27/Mar/20 ]

Ihara found that on RHEL7 there is a kernel module crc_t10dif, but on RHEL8 it is gone.

RHEL7:

[root@es400nv-vm1 ~]# lsmod | grep crc
crc32_pclmul 13133 0
crc_t10dif 12912 2 obdclass,sd_mod
crct10dif_generic 12647 0
crct10dif_pclmul 14307 1
crct10dif_common 12595 3 crct10dif_pclmul,crct10dif_generic,crc_t10dif
crc32c_intel 22094 0

[root@es400nv-vm1 ~]# grep -i crct10 /boot/config-*
CONFIG_CRYPTO_CRCT10DIF=m
CONFIG_CRYPTO_CRCT10DIF_PCLMUL=m

RHEL8:

[root@ec01 ~]# lsmod | grep crc
crct10dif_pclmul 16384 0
crc32_pclmul 16384 0
crc32c_intel 24576 4

[root@ec01 ~]# grep -i crct10 /boot/config-4.18.0-147.5.1.el8_1.x86_64
CONFIG_CRYPTO_CRCT10DIF=y
CONFIG_CRYPTO_CRCT10DIF_PCLMUL=m

I think change the module crc_t10dif into inline kernel is the root cause. When crc_t10dif module is inserted, crc_t10dif_mod_init() will choose the quickest crct10dif algorithm. And on RHEL7.7, since it can be inserted after crct10dif_pclmul, so it chooses the quicker one crct10dif_pclmul, not crct10dif_common.

However, on RHEL8, crc_t10dif_mod_init() is called too early, so it has no choice but use the slow one.

Comment by Li Xi [ 27/Mar/20 ]

I think two things need to be done:

1) Send a patch or create a ticket on Redhat so they can change CONFIG_CRYPTO_CRCT10DIF back to module. The crc_t10dif needs to have the capability to be removed and inserted later, because other modules might want to register better quicker algorithms of crc_t10dif.

2) Change Lustre codes to select the current quickest algorithm of crc_t10dif by itself rather than calling crc_t10dif() function. That will solve our problem of Lustre clients on all kernels.

Comment by Dongyang Li [ 30/Mar/20 ]

I think a easy fix would be changing our kernel config to either

CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y or CONFIG_CRYPTO_CRCT10DIF back to m.

since the lustre patched kernel needs to be built anyway

Comment by Li Xi [ 30/Mar/20 ]

I think a easy fix would be changing our kernel config to either

CONFIG_CRYPTO_CRCT10DIF_PCLMUL=y or CONFIG_CRYPTO_CRCT10DIF back to m.

since the lustre patched kernel needs to be built anyway

I not sure that is enough. This is a client side problem. And we can't control the kernel of Lustre clients.

Comment by Dongyang Li [ 30/Mar/20 ]

yes this is a client side problem as well, that part we have no control of.

however the server side kernel config is maintained by us, at least we should fix that.

Comment by Andreas Dilger [ 30/Apr/20 ]

In theory, even for the client, we could build and insert a replacement t10crc module as part of the Lustre client that installs an "accelerated" checksum code to be used, even if it is just the same kernel code rebuilt with some small name changes? That would avoid patching the kernel, and it could be dropped once RHEL8 is fixed (unless it is already fixed in RHEL8.2, and we should just use that).

Comment by Shuichi Ihara [ 30/Apr/20 ]

Andreas, i was trying your suggstion, but there is good news that Redhat already fixed t10crc problem in the latest RHEL8.1 updated kernel 4.18.0-147.8.1.el8_1.x86_64.

  • kernel: T10 CRC not using hardware-accelerated version from crct10dif_pclmul (BZ#1797961)

https://access.redhat.com/errata/RHSA-2020:1372

I just checked checksum speed with t10crc algorithms and confirmed t10crc works well as expected.

4.18.0-147.5.1.el8_1.x86_64

T10 checksum algorithm t10ip512 speed = 20811 MB/s
T10 checksum algorithm t10ip4K speed = 21971 MB/s
T10 checksum algorithm t10crc512 speed = 329 MB/s
T10 checksum algorithm t10crc4K speed = 333 MB/s

4.18.0-147.8.1.el8_1.x86_64

T10 checksum algorithm t10ip512 speed = 20727 MB/s
T10 checksum algorithm t10ip4K speed = 21967 MB/s
T10 checksum algorithm t10crc512 speed = 9215 MB/s
T10 checksum algorithm t10crc4K speed = 15647 MB/s
 
Comment by Peter Jones [ 30/Apr/20 ]

I think that we can just close this as Will Not Fix then because it will certainly be fixed in RHEL 8.2 too

Generated at Sat Feb 10 03:00:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.