[LU-14584] LNet: 2 CPTs on a single NUMA node instead of one Created: 05/Apr/21 Updated: 07/Apr/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
By default, Lustre 2.14 LNet routers detect 2 CPTs instead of 1 CPT on a single NUMA node server. If not discovered, this could lead to very unbalance routers: [root@sh02-oak01 ~]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo up 0 2 0 0 0 0 0 0@lo up 0 0 0 0 0 0 0 10.50.0.131@o2ib2 up 0 122544 8 0 128 127 40 10.50.0.131@o2ib2 up 0 1 8 0 128 127 57 10.0.2.214@o2ib5 up 0 2 8 0 128 128 75 10.0.2.214@o2ib5 up 0 2 8 0 128 128 70 The expected behavior is that LNet would instantiate only a single CPT if there is a single NUMA node available. More about such single NUMA node LNet router available below. [root@sh02-oak01 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz Stepping: 1 CPU MHz: 1387.481 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 6999.30 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d [root@sh02-oak01 ~]# ls -ald /sys/devices/system/node/node* drwxr-xr-x 4 root root 0 Apr 5 14:10 /sys/devices/system/node/node0 [root@sh02-oak01 ~]# numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 65314 MB node 0 free: 42722 MB node distances: node 0 0: 10 [root@sh02-oak01 ~]# lctl get_param cpu_partition_table cpu_partition_table=0 : 0 1 4 5 1 : 2 3 6 7 This is happening with a default libcfs configuration (no libcfs module tuning): [root@sh02-oak01 ~]# cat /sys/module/libcfs/parameters/cpu_npartitions 0 [root@sh02-oak01 ~]# cat /sys/module/libcfs/parameters/cpu_pattern N |
| Comments |
| Comment by Amir Shehata (Inactive) [ 07/Apr/21 ] |
|
What's the output of lnetctl net show -v 4? on the routers. Also can you share how you configure your routers? do you configure them as services? How do you load the configuration? From modprobe.d or from /etc/lnet.conf? Technically, LNet doesn't create the NUMA binding it queries it from libcfs. |
| Comment by Stephane Thiell [ 07/Apr/21 ] |
|
Hi Amir!
We use lnet.conf and the lnet.service: /etc/lnet.conf: global:
- health_sensitivity: 0
net:
- net type: o2ib1
local NI(s):
- nid:
interfaces:
0: ib0
- net type: o2ib5
local NI(s):
- nid:
interfaces:
0: ib1
routing:
- enable: 1
[root@sh01-oak01 ~]# systemctl status lnet.service
● lnet.service - lnet management
Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/lnet.service.d
└─deps.conf, ibdev.conf
Active: active (exited) since Fri 2021-04-02 13:58:26 PDT; 4 days ago
Process: 79596 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, status=0/SUCCESS)
Process: 79592 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, status=0/SUCCESS)
Process: 79585 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS)
Process: 79443 ExecStartPre=/bin/sh -c sleep 5 (code=exited, status=0/SUCCESS)
Process: 78830 ExecStartPre=/usr/bin/systemctl restart openibd (code=exited, status=0/SUCCESS)
Main PID: 79596 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/lnet.service
Apr 02 13:57:58 sh01-oak01.int systemd[1]: Starting lnet management...
Apr 02 13:58:26 sh01-oak01.int systemd[1]: Started lnet management.
lnet service overrides: [root@sh01-oak01 ~]# cat /etc/systemd/system/lnet.service.d/deps.conf [Unit] After=dkms.service [root@sh01-oak01 ~]# cat /etc/systemd/system/lnet.service.d/ibdev.conf [Service] ExecStartPre=/usr/bin/systemctl restart openibd ExecStartPre=/bin/sh -c 'sleep 5' We have the default /etc/modprobe.d/ko2iblnd.conf from lustre-client RPM, untouched. Finally, to force the use of 1 CPT, we use the following configuration in /etc/modprobe.d/lnet.conf: options libcfs cpu_pattern="0[0-7]" |
| Comment by Stephane Thiell [ 07/Apr/21 ] |
|
BTW, the high numbers of refs that I mentioned in this ticket is probably not related to this CPT issue at all, for that I opened LU-14589. |
| Comment by Amir Shehata (Inactive) [ 07/Apr/21 ] |
|
When you have cpu_pattern=N (or undefined) can you show me the output of cat /sys/kernel/debug/lnet/cpu_partition_distance |
| Comment by Stephane Thiell [ 07/Apr/21 ] |
[root@sh01-oak02 ~]# cat /sys/module/libcfs/parameters/cpu_pattern N [root@sh01-oak02 ~]# cat /sys/kernel/debug/lnet/cpu_partition_distance 0 : 0:10 1:10 1 : 0:10 1:10 [root@sh01-oak02 ~]# cat /sys/kernel/debug/lnet/cpu_partition_table 0 : 0 1 4 5 1 : 2 3 6 7 And just to make sure, this one has a single NUMA node: [root@sh01-oak02 ~]# ls -ald /sys/devices/system/node/node* drwxr-xr-x 4 root root 0 Apr 5 15:54 /sys/devices/system/node/node0 [root@sh01-oak02 ~]# |
| Comment by Amir Shehata (Inactive) [ 07/Apr/21 ] |
|
Yes. So that's the issue (guess you added that at the top, but I missed it). For some reason libcfs is creating two CPTs even though there is only 1 NUMA node. I couldn't reproduce this issue on my VM. We will need to look at the code around this area and see if there has been some recent changes. |
| Comment by Stephane Thiell [ 07/Apr/21 ] |
|
OK. We looked at some old Splunk logs, as LNet prints a message when loading, and it looks like it's not a new issue! Even with previous versions of Lustre (prior to 2.13), these routers were initializing 2 CPTs. From 2019: kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 8, npartitions: 2 From the Lustre manual:
Maybe it is a remnant of pre-Lustre 2.9, eg. if HW NUMA = 1 and core count > 4, then 1 CPT is used for every four cores? |