Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
master branch, AMD EYPC CPU
-
3
-
9223372036854775807
Description
For instance, AMD EPYC 7551(32 CPU cores) has 4 dies on an CPU socket and each die consists of 8 CPU cores and also each numa node.
If two CPU sockets per client, total 64 CPU cores (128 CPU cores with logical processors) and 8 NUMA nodes.
# numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71 node 0 size: 32673 MB node 0 free: 31561 MB node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79 node 1 size: 32767 MB node 1 free: 31930 MB node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87 node 2 size: 32767 MB node 2 free: 31792 MB node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95 node 3 size: 32767 MB node 3 free: 31894 MB node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103 node 4 size: 32767 MB node 4 free: 31892 MB node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111 node 5 size: 32767 MB node 5 free: 30676 MB node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119 node 6 size: 32767 MB node 6 free: 30686 MB node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127 node 7 size: 32767 MB node 7 free: 32000 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10
Also first generation EYPC(Naples) has PCI controller per die (NUMA node) and IB HCA connected to one of PCIe controller like below.
# cat /sys/class/infiniband/mlx5_0/device/numa_node 5
mlx5_0 adapter conneted to CPU1's NUMA node1 which is numa node 5 in 2 socket configuration.
In this case, default LNET performance doesn't perform well and it requires manual CPT setting, but still highly relies on what CPT configuration and cpu cores are involved.
Here is quick LNET selftest results with default CPT and NUMA aware an CPT configuration.
default CPT setting(cpu_npartitions=8) client:server PUT(GB/s) GET(GB/s) 1:1 7.0 6.8 1:2 11.3 3.2 1:4 11.4 3.4 1 CPT(cpu_npartitions=1 cpu_pattern="0[40-47,104,111]") client:server PUT(GB/s) GET(GB/s) 1:1 11.0 11.0 1:2 11.4 11.4 1:4 11.4 11.4
numa aware CPT configuration made much better LNET performance, but CPT requires not only LNET, but also all other Lustre client threads. In general, we want to get more number of CPU cores and CPTs involved, but LNET needs to be ware of CPT and NUMA node where network interface installed.
Attachments
Issue Links
- is related to
-
LU-12194 clients getting soft lockups on 2.10.7
-
- Open
-
Actually, I understood that there is PCI bandwdith limiation on dual port HCA, but 3-4GB is REALLY lower than expected single EDR bandwdith and I was suspecting something NUMA or NUMA/IO or CPT related problem behind. I don't want to get higher bandwdith by number of HCA here, but I am trying number of configurations (e.g. increasing peers, pin CPT to interface, use an dedicate CPT, etc) since as I said before, we are getting better performance on single EPYC client, but once added another CPU, read performance drops.