[LU-12667] Read doesn't perform well in complex NUMA configuration - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
master branch, AMD EYPC CPU

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

For instance, AMD EPYC 7551(32 CPU cores) has 4 dies on an CPU socket and each die consists of 8 CPU cores and also each numa node.
If two CPU sockets per client, total 64 CPU cores (128 CPU cores with logical processors) and 8 NUMA nodes.

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 32673 MB
node 0 free: 31561 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 32767 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 32767 MB
node 2 free: 31792 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 32767 MB
node 3 free: 31894 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 32767 MB
node 4 free: 31892 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 32767 MB
node 5 free: 30676 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 32767 MB
node 6 free: 30686 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 32767 MB
node 7 free: 32000 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10

Also first generation EYPC(Naples) has PCI controller per die (NUMA node) and IB HCA connected to one of PCIe controller like below.

# cat /sys/class/infiniband/mlx5_0/device/numa_node 
5

mlx5_0 adapter conneted to CPU1's NUMA node1 which is numa node 5 in 2 socket configuration.

In this case, default LNET performance doesn't perform well and it requires manual CPT setting, but still highly relies on what CPT configuration and cpu cores are involved.
Here is quick LNET selftest results with default CPT and NUMA aware an CPT configuration.

default CPT setting(cpu_npartitions=8)
client:server   PUT(GB/s)  GET(GB/s)
     1:1          7.0        6.8 
     1:2         11.3        3.2
     1:4         11.4        3.4

1 CPT(cpu_npartitions=1 cpu_pattern="0[40-47,104,111]")
client:server   PUT(GB/s)  GET(GB/s)
     1:1         11.0       11.0
     1:2         11.4       11.4
     1:4         11.4       11.4

numa aware CPT configuration made much better LNET performance, but CPT requires not only LNET, but also all other Lustre client threads. In general, we want to get more number of CPU cores and CPTs involved, but LNET needs to be ware of CPT and NUMA node where network interface installed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lnet_selftest.sh
2 kB
15/Aug/19 2:23 PM
LU-12667-lnetselftest-results.txt
9 kB
15/Aug/19 2:23 PM

Issue Links

is related to

LU-12194 clients getting soft lockups on 2.10.7

Open

Activity

[LU-12667] Read doesn't perform well in complex NUMA configuration

People

Assignee:: WC Triage

Reporter:: Shuichi Ihara

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 15/Aug/19 2:19 PM

Updated:: 08/Jun/20 10:19 PM