[LU-14584] LNet: 2 CPTs on a single NUMA node instead of one Created: 05/Apr/21  Updated: 07/Apr/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7


Attachments: Text File sh01-oak01-numa1cpt1_forced.txt     Text File sh01-oak01-numa1cpt2.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

By default, Lustre 2.14 LNet routers detect 2 CPTs instead of 1 CPT on a single NUMA node server. If not discovered, this could lead to very unbalance routers:

[root@sh02-oak01 ~]# cat /sys/kernel/debug/lnet/nis
nid                      status alive refs peer  rtr   max    tx   min
0@lo                         up     0    2    0    0     0     0     0
0@lo                         up     0    0    0    0     0     0     0
10.50.0.131@o2ib2            up     0 122544    8    0   128   127    40
10.50.0.131@o2ib2            up     0    1    8    0   128   127    57
10.0.2.214@o2ib5             up     0    2    8    0   128   128    75
10.0.2.214@o2ib5             up     0    2    8    0   128   128    70

The expected behavior is that LNet would instantiate only a single CPT if there is a single NUMA node available. More about such single NUMA node LNet router available below.

[root@sh02-oak01 ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
Stepping:              1
CPU MHz:               1387.481
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              6999.30
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d


[root@sh02-oak01 ~]# ls -ald  /sys/devices/system/node/node*
drwxr-xr-x 4 root root 0 Apr  5 14:10 /sys/devices/system/node/node0


[root@sh02-oak01 ~]# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65314 MB
node 0 free: 42722 MB
node distances:
node   0 
  0:  10 


[root@sh02-oak01 ~]# lctl get_param cpu_partition_table
cpu_partition_table=0	: 0 1 4 5
1	: 2 3 6 7

This is happening with a default libcfs configuration (no libcfs module tuning):

[root@sh02-oak01 ~]# cat /sys/module/libcfs/parameters/cpu_npartitions 
0
[root@sh02-oak01 ~]# cat /sys/module/libcfs/parameters/cpu_pattern 
N


 Comments   
Comment by Amir Shehata (Inactive) [ 07/Apr/21 ]

What's the output of

lnetctl net show -v 4?

on the routers.

Also can you share how you configure your routers? do you configure them as services? How do you load the configuration? From modprobe.d or from /etc/lnet.conf?

Technically, LNet doesn't create the NUMA binding it queries it from libcfs.

Comment by Stephane Thiell [ 07/Apr/21 ]

Hi Amir!

  • attaching the output of lnetctl net show -v 4 on a router with 1 NUMA node and the default libcfs config, which shows CPT=2 see sh01-oak01-numa1cpt2.txt
  • attaching the output of lnetctl net show -v 4 on a router with 1 NUMA node and the forced CPT=1 config, so CPT=1 see sh01-oak01-numa1cpt1_forced.txt

We use lnet.conf and the lnet.service:

/etc/lnet.conf:

global:
    - health_sensitivity: 0
net:
    - net type: o2ib1
      local NI(s):
        - nid:
          interfaces:
            0: ib0
    - net type: o2ib5
      local NI(s):
        - nid:
          interfaces:
            0: ib1
routing:
    - enable: 1
[root@sh01-oak01 ~]# systemctl status lnet.service
● lnet.service - lnet management
   Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/lnet.service.d
           └─deps.conf, ibdev.conf
   Active: active (exited) since Fri 2021-04-02 13:58:26 PDT; 4 days ago
  Process: 79596 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, status=0/SUCCESS)
  Process: 79592 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, status=0/SUCCESS)
  Process: 79585 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS)
  Process: 79443 ExecStartPre=/bin/sh -c sleep 5 (code=exited, status=0/SUCCESS)
  Process: 78830 ExecStartPre=/usr/bin/systemctl restart openibd (code=exited, status=0/SUCCESS)
 Main PID: 79596 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/lnet.service

Apr 02 13:57:58 sh01-oak01.int systemd[1]: Starting lnet management...
Apr 02 13:58:26 sh01-oak01.int systemd[1]: Started lnet management.

lnet service overrides:

[root@sh01-oak01 ~]# cat /etc/systemd/system/lnet.service.d/deps.conf 
[Unit]
After=dkms.service
[root@sh01-oak01 ~]# cat /etc/systemd/system/lnet.service.d/ibdev.conf 
[Service]
ExecStartPre=/usr/bin/systemctl restart openibd
ExecStartPre=/bin/sh -c 'sleep 5'

We have the default /etc/modprobe.d/ko2iblnd.conf from lustre-client RPM, untouched.

Finally, to force the use of 1 CPT, we use the following configuration in /etc/modprobe.d/lnet.conf:

options libcfs cpu_pattern="0[0-7]"
Comment by Stephane Thiell [ 07/Apr/21 ]

BTW, the high numbers of refs that I mentioned in this ticket is probably not related to this CPT issue at all, for that I opened LU-14589.

Comment by Amir Shehata (Inactive) [ 07/Apr/21 ]

When you have cpu_pattern=N (or undefined) can you show me the output of

cat /sys/kernel/debug/lnet/cpu_partition_distance
cat /sys/kernel/debug/lnet/cpu_partition_table

Comment by Stephane Thiell [ 07/Apr/21 ]
[root@sh01-oak02 ~]# cat /sys/module/libcfs/parameters/cpu_pattern 
N
[root@sh01-oak02 ~]# cat /sys/kernel/debug/lnet/cpu_partition_distance
0	: 0:10 1:10
1	: 0:10 1:10
[root@sh01-oak02 ~]# cat /sys/kernel/debug/lnet/cpu_partition_table
0	: 0 1 4 5
1	: 2 3 6 7

And just to make sure, this one has a single NUMA node:

[root@sh01-oak02 ~]# ls -ald  /sys/devices/system/node/node*
drwxr-xr-x 4 root root 0 Apr  5 15:54 /sys/devices/system/node/node0
[root@sh01-oak02 ~]# 
Comment by Amir Shehata (Inactive) [ 07/Apr/21 ]

Yes. So that's the issue (guess you added that at the top, but I missed it). For some reason libcfs is creating two CPTs even though there is only 1 NUMA node. I couldn't reproduce this issue on my VM. We will need to look at the code around this area and see if there has been some recent changes.

Comment by Stephane Thiell [ 07/Apr/21 ]

OK. We looked at some old Splunk logs, as LNet prints a message when loading, and it looks like it's not a new issue! Even with previous versions of Lustre (prior to 2.13), these routers were initializing 2 CPTs. From 2019:

kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 8, npartitions: 2

From the Lustre manual:

Introduced in Lustre 2.9
In Lustre 2.9 and later the default is to use one CPT per NUMA node. In earlier versions of Lustre, by default there was a single CPT if the online CPU core count was four or fewer, and additional CPTs would be created depending on the number of CPU cores, typically with 4-8 cores per CPT.

Maybe it is a remnant of pre-Lustre 2.9, eg. if HW NUMA = 1 and core count > 4, then 1 CPT is used for every four cores?

Generated at Sat Feb 10 03:11:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.