socklnd needs improved interface selection and configuration (LU-14064)

[LU-14676] Better hash distribution to different CPTs when LNET router is exist Created: 07/May/21  Updated: 22/Mar/23  Resolved: 31/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Technical task Priority: Minor
Reporter: Shuichi Ihara Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11454 Allow switching off CPT binding for P... Resolved
is related to LU-56 Finish SMP scalability work Resolved
is related to LU-13621 LNET peer doesn't distribute well to ... Resolved
is related to LU-7245 Improve SMP scaling support for LND d... Resolved
is related to LU-14293 Poor lnet/ksocklnd(?) performance on ... Resolved
is related to LU-12815 Create multiple TCP sockets per SockLND Resolved
Rank (Obsolete): 9223372036854775807

 Description   

When server receives messages from the clients, those messages are going into each CPT(CPU partition), then pass them to upper layer.
And CPT ID distribution is decided by hashing based on client's NID.

However, if there is lnet routers between clients and servers, hashing is based on router's NID, not client's NIDs.
Let's assume the following configuration.
1 x server(20 cpu cores, CPT=20 means 1 CPU core belong into each CPT)
1 x lnet router
10 x client

Without LNET router
All client's NID are active.

nid                      refs state  last   max   rtr   min    tx   min queue
0@lo                        1    NA    -1     0     0     0     0     0 0
10.0.0.34@o2ib12            7    NA    -1     8     8     8     2   -20 3616
10.0.11.226@o2ib12          1    NA    -1     8     8     8     8    -8 0  
10.0.0.39@o2ib12            5    NA    -1     8     8     8     4   -18 2560
10.0.0.31@o2ib12            5    NA    -1     8     8     8     4   -20 1984
10.0.0.35@o2ib12            5    NA    -1     8     8     8     4   -18 1752
10.0.0.36@o2ib12            6    NA    -1     8     8     8     3   -19 2544
10.0.0.32@o2ib12            1    NA    -1     8     8     8     8   -18 0   
10.0.0.33@o2ib12            6    NA    -1     8     8     8     3   -17 2312
10.0.11.225@o2ib12          1    NA    -1     8     8     8     8    -8 0   
10.0.0.40@o2ib12            6    NA    -1     8     8     8     3   -19 3056
10.0.0.38@o2ib12            1    NA    -1     8     8     8     8   -21 0   
10.0.11.227@o2ib12          1    NA    -1     8     8     8     8    -8 0  
10.0.0.37@o2ib12            6    NA    -1     8     8     8     3   -18 3248

And, those messages are handled by lnet threads in different CPTs because of hash(client's NID).

top - 01:05:44 up 1 day, 16:09,  2 users,  load average: 39.70, 18.43, 7.46
Tasks: 1442 total,  75 running, 1367 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 50.6 sy,  0.0 ni, 48.3 id,  1.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 15369398+total, 13057227+free, 18096748 used,  5024956 buff/cache
KiB Swap: 11075580 total, 11075580 free,        0 used. 13475987+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                          
17303 root      20   0       0      0      0 S  17.8  0.0   0:03.16 kworker/u40:1                                    
17601 root      20   0       0      0      0 S   9.2  0.0   0:00.47 ll_ost19_004                                     
17642 root      20   0       0      0      0 S   7.9  0.0   0:00.33 ll_ost19_007                                     
16187 root      20   0       0      0      0 R   7.3  0.0   0:07.56 kiblnd_sd_03_00                                  
16192 root      20   0       0      0      0 R   7.3  0.0   0:07.57 kiblnd_sd_08_00                                  
16198 root      20   0       0      0      0 R   7.3  0.0   0:11.14 kiblnd_sd_14_00                                  
16201 root      20   0       0      0      0 R   7.3  0.0   0:07.70 kiblnd_sd_17_00                                  
16632 root      20   0       0      0      0 R   7.3  0.0   0:07.10 mdt03_000                                        
16634 root      20   0       0      0      0 R   7.3  0.0   0:06.95 mdt03_002                                        
16647 root      20   0       0      0      0 R   7.3  0.0   0:07.24 mdt08_000                                        
16649 root      20   0       0      0      0 R   7.3  0.0   0:07.06 mdt08_002   

With LNET router

It's same test from 10 clients, but messages goes through the lnet router.
There is only single active NID on server which is router node.

nid                      refs state  last   max   rtr   min    tx   min queue
0@lo                        1    NA    -1     0     0     0     0     0 0
192.168.11.35@o2ib10        2    NA    -1     0     0     0     0     0 0
10.0.11.226@o2ib12          1    NA    -1     8     8     8     8    -8 0   
192.168.11.36@o2ib10        2    NA    -1     0     0     0     0     0 0
10.12.11.135@o2ib12        13    up    -1     8     8     8     3   -94 3248
192.168.11.40@o2ib10        2    NA    -1     0     0     0     0     0 0
192.168.11.32@o2ib10        2    NA    -1     0     0     0     0     0 0
192.168.11.33@o2ib10        2    NA    -1     0     0     0     0     0 0
192.168.11.37@o2ib10        2    NA    -1     0     0     0     0     0 0
192.168.11.34@o2ib10        2    NA    -1     0     0     0     0     0 0
10.0.11.225@o2ib12          1    NA    -1     8     8     8     8    -7 0  
192.168.11.39@o2ib10        2    NA    -1     0     0     0     0     0 0
10.0.11.227@o2ib12          1    NA    -1     8     8     8     8    -8 0  
192.168.11.38@o2ib10        2    NA    -1     0     0     0     0     0 0
192.168.11.31@o2ib10        2    NA    -1     0     0     0     0     0 0

Then, that goes into cpt=2 and other 19 CPTs (19 CPU cores) are idle.

Tasks: 1067 total,   3 running, 1064 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  5.0 sy,  0.0 ni, 94.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15369398+total, 14010728+free, 12679048 used,   907648 buff/cache
KiB Swap: 11075580 total, 11075580 free,        0 used. 14017849+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                          
13044 root      20   0       0      0      0 R   7.0  0.0   0:01.81 kiblnd_sd_02_00                                  
14146 root      20   0       0      0      0 S   6.6  0.0   0:00.93 mdt02_005                                        
13489 root      20   0       0      0      0 R   6.3  0.0   0:01.37 mdt02_002   

It would be nice to have better hashing to distribute messages to different CPTs on server to improve metadata performance and IOPS when LNET router is exist.



 Comments   
Comment by Andreas Dilger [ 07/May/21 ]

It would be good to understand what the best way to distribute RPC handling across CPTs is. Should these be evenly distributed to balance load, or is there a benefit to make some affinity based on the OST/MDT target and/or object FID? Does the initial CPT selection also affect later RPC processing?

Comment by Shuichi Ihara [ 07/May/21 ]

As far as I observed behaviors, if threads are aware of CPT (e.g. mdtXX_YYY, mdt_ioXX_YYY, ll_ostXX_YYY, ptlrpcd_XX_YY, etc :XX is CPT ID), those upper layer threads are relied on CPT which is firstly selected by LNET.
Here is "top" results when is mdtest is running on 10 clients through a LNET router. all threads are associated with CPT=2 and CPU2 belong to.
Yes, this is real problem in the end. There are no chances to re-distribute across other CPTs after LNET.

# cat /sys/kernel/debug/lnet/cpu_partition_table 
0	: 0
1	: 1
2	: 2
3	: 3
4	: 4
5	: 5
6	: 6
7	: 7
8	: 8
9	: 9
10	: 10
11	: 11
12	: 12
13	: 13
14	: 14
15	: 15
16	: 16
17	: 17
18	: 18
19	: 19
Tasks: 1245 total,   3 running, 1242 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us, 99.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu7  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15369398+total, 13857009+free, 12370992 used,  2752888 buff/cache
KiB Swap: 11075580 total, 11075580 free,        0 used. 14048668+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                        
19443 root      20   0       0      0      0 S   7.6  0.0   0:10.23 ptlrpc_hr02_000                                
19476 root      20   0       0      0      0 R   5.6  0.0   0:15.99 kiblnd_sd_02_00                                
20577 root      20   0       0      0      0 R   5.6  0.0   0:07.73 mdt02_003                                      
20586 root      20   0       0      0      0 S   5.0  0.0   0:03.68 mdt02_009                                      
20826 root      20   0       0      0      0 S   5.0  0.0   0:02.87 mdt02_058                                      
20813 root      20   0       0      0      0 S   4.0  0.0   0:00.88 mdt02_045                                      
20823 root      20   0       0      0      0 S   4.0  0.0   0:02.90 mdt02_055                                      
20663 root      20   0       0      0      0 S   3.6  0.0   0:06.75 mdt02_017                                      
20667 root      20   0       0      0      0 S   3.6  0.0   0:04.53 mdt02_021                                      
20816 root      20   0       0      0      0 S   3.6  0.0   0:02.77 mdt02_048                                      
20591 root      20   0       0      0      0 S   3.3  0.0   0:00.66 mdt_rdpg02_002                                 
20670 root      20   0       0      0      0 S   3.3  0.0   0:05.15 mdt02_024                                      
20786 root      20   0       0      0      0 S   3.3  0.0   0:00.31 mdt_rdpg02_004                                 
20795 root      20   0       0      0      0 S   3.3  0.0   0:00.29 mdt_rdpg02_013                                 
20814 root      20   0       0      0      0 S   3.3  0.0   0:03.30 mdt02_046                                      
20792 root      20   0       0      0      0 S   3.0  0.0   0:00.28 mdt_rdpg02_010                                 
20250 root      20   0       0      0      0 S   2.6  0.0   0:00.59 ll_ost02_002                                   
20660 root      20   0       0      0      0 S   2.6  0.0   0:02.86 mdt02_014                                      
20822 root      20   0       0      0      0 S   2.6  0.0   0:02.46 mdt02_054                                      
19906 root      20   0       0      0      0 S   2.0  0.0   0:07.22 jbd2/dm-4-8                                    
20585 root      20   0       0      0      0 S   2.0  0.0   0:06.23 mdt02_008                                      
20611 root      20   0       0      0      0 S   2.0  0.0   0:00.06 ll_ost_io02_025                                
20589 root      20   0       0      0      0 S   1.7  0.0   0:00.10 ll_ost_io02_005                                
20599 root      20   0       0      0      0 S   1.7  0.0   0:00.06 ll_ost_io02_013                                
20604 root      20   0       0      0      0 S   1.7  0.0   0:00.05 ll_ost_io02_018                                
20797 root      20   0       0      0      0 S   1.7  0.0   0:00.07 mdt_rdpg02_015                                 
20587 root      20   0       0      0      0 S   1.3  0.0   0:00.07 ll_ost_io02_003                                
20593 root      20   0       0      0      0 S   1.3  0.0   0:00.09 ll_ost_io02_007                                
20600 root      20   0       0      0      0 S   1.3  0.0   0:00.06 ll_ost_io02_014                                
20602 root      20   0       0      0      0 S   1.3  0.0   0:00.05 ll_ost_io02_016                                
20606 root      20   0       0      0      0 S   1.3  0.0   0:00.04 ll_ost_io02_020                                
20610 root      20   0       0      0      0 S   1.3  0.0   0:00.04 ll_ost_io02_024 
Comment by Andreas Dilger [ 08/May/21 ]

If the same CPT is used for the entire lifetime of the RPC, this may be a factor why we've had performance limits for single-client IO in the past? It probably makes sense to have CPT affinity based not (only?) on the source NID, but rather the target MDT/OST and FID (+object offset/hash?) so that load is distributed across CPTs even if only a single client is doing IO.

One option would be for Lustre to register a callback with LNet to map the RPC to decide CPT affinity, so that it could be mapped to a specific core for a particular OST and object and offset so that there is page/cache locality for each object. We already have request_in_callback() that is decoding the request, but I think that is only called after the CPT is selected.

Comment by Shuichi Ihara [ 10/May/21 ]

If the client sends rpcs to multiple servers, I don't think the client performance performance limit come from here.
However, when if the client sends rpcs to a server and there is only an CPU core in CPT on server, there are some performance limit due to only single CPU core is participated, but we can add more CPU cores into CPT.

LU-13621 is related ticket. patch https://review.whamcloud.com/39113 is a helper tool to print CPT associated with NID.

Let's assume 20 CPTs and 40 clients and those IP address are 192.[1-40].1.1.

# for i in `seq 1 40`; do lnetctl cpt-of-nid --nid 192.$i.1.1@tcp9 --ncpt 20; done | grep value | sort | uniq -c
      8     value: 0
      8     value: 12
      8     value: 16
      8     value: 4
      8     value: 8

Server distributes NIDs only to 5 CPTs means it only uses #CPU x 5/#CPT in this case.

here is difernet case. It's still 20 CPTs and 40 clients but IP address's range are 192.1.1.[1-40].

[root@sky06 ~]# for i in `seq 1 40`; do lnetctl cpt-of-nid --nid 192.1.1.$i@tcp9 --ncpt 20; done | grep value | sort | uniq -c
      2     value: 0
      2     value: 1
      3     value: 10
      2     value: 12
      3     value: 13
      2     value: 14
      3     value: 15
      1     value: 16
      1     value: 17
      2     value: 18
      2     value: 19
      3     value: 2
      1     value: 3
      2     value: 4
      2     value: 5
      2     value: 6
      1     value: 7
      2     value: 8
      4     value: 9

The best case, it would expect 2 clients per CPT, but it shows many unbalanced CPT distribution. (even CPT=11 is not used)

Since there are no chances of CPT re-distribution later LNET, there is an limitation here and that needs to consider what NID goes into which CPT and how they are distributing well to all CPTs.

Comment by Andreas Dilger [ 11/May/21 ]

Definitely it seems like the existing hash function used does not result in a good distribution of NIDs across CPTs. Since most clusters will have (fairly) linear NID assignments to clients, I'm wondering if it makes sense to have something simple like (sum of NID bytes % ncpts) so that we get a very uniform distribution across CPTs? That would get good results up to 256 CPTs on the server, so a little while yet (I'm not sure if there is some other CPT limit).

Separately, there is a question whether a 1:1 mapping of NID->CPT makes sense? Limiting a single-rail client to a single CPT on the server seems like it can cause bottlenecks for clients (we've previously seen reports with e.g. 3GB/s limit for a single client, no matter what we do to optimize it). The client is working hard to submit IO in parallel from multiple threads/cores (e.g. ptlrpcd and osc.*.max_rpcs_in_flight) so it doesn't make sense to funnel this back into a single CPT on the server again.

Amir, Shuichi, what is the typical number of CPTs and cores per CPT on AI400 and ES18XKE systems for OSTs and MDTs? Maybe it is not such a big deal if there are only a typically small number of CPTs configured, or is the "one CPT per core" described here the normal server configuration? Also, it seems uncommon that there would be a single LNet router for a system? The clusters I've seen typically have tens of routers to get good bandwidth.

One option would be to stick with the current initial 1:1 NID->CPT mapping (possibly with a better mapping function), but then implement a "work stealing" mechanism at the ptlrpc level, so that in a heavily loaded case there is good CPU affinity for processing incoming requests, but if clients < CPTs (also router case) then the work is still distributed across cores.

Comment by Shuichi Ihara [ 12/May/21 ]

This is one of reason (not all cases) sometimes we saw long tail of IOR or mdtest with a specific thread's of clients.
For instance, io500's 10 node challenge. If those 10 clients are not distributing to all CPTs well balanced on the servers, some mdtest/ior threads may take time a longer in the end.

Today's CPT setting is a bit annoying. In order to reduce contentions or context switching between cpu cores, 1 core per CPT was best configuration as far as I have tested IOPS and metadata performance in 8 CPU core's server configuration. (If all client's NIDs hashes to all CPTs well)
However, this is best for aggregated performance. If the client has only a network interface and when we have look at single client's IOPS or metadata performance, single CPT(and single CPU core) is definitely not enough.
If the client has many NIDs, it can bump up the performance regardless that's physical or logical, but many cases are single network interface per client.
I thought conns_per_peer=X in ko2iblnd help, but it still didn't help because the number of NID doesn't change.

So, one CPU core per CPT is good to maximize utilization of CPU with less cpu contentions today, but we also need to figure out how many CPU cores (CPTs) can be participated for the single client if one CPU per CPT is configured on server.

Ideally, more dynamic load-balancing (per rpc or lnet messages or object basics?) would be preferred than 1:1 mapping of NID<->CPT...

Comment by Amir Shehata (Inactive) [ 13/May/21 ]

just to chime in with some of the details. We have to look at the send and receive paths regarding CPT distribution.

Send Path

**On a call to LNetGet()/LNetPut() we just lock the current cpt, so that's whatever CPT the current core we're executing on belongs to. I believe this is determined by the CPT the ptlrpc thread making the LNetGet()/LNetPut() calls is bound to. While under that lock we perform local NI and remote NID selection for multi-rail. For local interface selection that's subject to multiple criteria: NUMA, Credit availability, round robin, etc. For the remote NID selection that criteria is based on credit availability and round robin.

If the final destination is remote then we select a gateway NID using the same remote selection criteria.

Once the selection is completed, we calculate a new CPT based on the NID of the next hop destination selected. That could be the final destination NID or a gateway NID. This could result in us switching the CPT lock. The new CPT is then used by the LND to queue the work to the correct LND scheduler thread. LNDs spawns multiple thread pools per CPT partition.

If we're always sending to the same final destination NID or we have a single interface router as the next-hop, then all work to that will be funnelled to the same CPT. Resources at the LND level are partitioned by  CPT, which means we could run into some resource restrictions in that case.

Receive Path

When a connection is established we determine the CPT of the connection based on the NID of the immediate next-hop we're receiving the message from, either the original source or that of a router. All subsequent messages received on that connection is processed by the a scheduler bound to that CPT; as there could be a pool of them.

In the receive path we're subject to the same issue described in this ticket.

Consideration

As a short term fix, we can always hash the NID of the final destination or the original source of the message. I don't believe there should be any draw backs to doing this. And it's a relatively easy change to make. It'll avoid the router in the middle problem we're seeing. But it doesn't completely resolve the problem.

As far as I understand, the reason for hashing the NID is to distribute the workload across different partitions in order to avoid using one lock. The hashing function is a strictly an LNet thing. It's implemented in lnet_cpt_of_nid(). I have a fundamental question regarding the decision to go with NID hashing for work distribution. Why was it implemented that way? Is there an advantage to always having messages from/to some peer be processed by the same set of threads locally? Why not change the implementation to a simple round robin selection, such that work is evenly distributed across all CPTs without regard to the NID?

Is there an advantage to keeping all messages of a specific RPC locked to the same CPT partition? I know for NUMA we want to lock the local interface used for the entire RPC duration to make sure we use the correct interface for the bulk RDMA transfer. But as far as I understand we don't have the same restriction on per LND message.

 

Comment by Andreas Dilger [ 13/May/21 ]

On the one hand, keeping RPCs from a single client on one CPT improves the CPU cache/object/locking locality if different clients are each working on different objects (typical file-per-process workload). However, as Shuichi pointed out, the hash function is poor for typical use, and using a simple round-robin by NID (sum NID bytes % ncpt) would likely be more uniform for common configurations.

If the CPT was based on the object FID then the CPU cache locality would likely be the same or better for the many client/many object case, but would be significantly improved for the few client/many object case. To avoid congestion for the many client/single object case, the FID-based CPT mapping should also take file offset into account (eg. ((sum FID bytes+(offset/1GB % 256)) % ncpt), so that RPCs (and NUMA page cache allocations) are distributed across cores. However, there are potentially considerations at the LNet level for keeping the peer state on a single CPT (I see lnet_cpt_of_nid() is used when locking) so it may not be possible/practical to change the CPT of a NID arbitrarily.

As for the exact details behind this, spelunking in Git history ("git log -S lnet_cpt_of_nid") shows LU-56 was the origin of this code, so hopefully that ticket and the patch comments can explain the motivation.

Comment by Andreas Dilger [ 13/May/21 ]

I think the first thing to test here is just a simple change to lnet_cpt_of_nid() to use the "(sum of nid bytes % ncpt)" method instead of the hash function. This will make CPT distribution flat for consecutive NIDs, and is low risk. It will not help the router/single client case. Anything beyond that needs an understanding of LNet locking to determine if eg. "source NID" could be used instead of the router NID, or there are specifics of peer state that depend on the router NID.

Comment by Amir Shehata (Inactive) [ 13/May/21 ]

We can test changing lnet_cpt_of_nid() However, I think it's worth it to try and use the source/final destination NID as well. Implementation wise I think as long as we use a consistent NID for hashing it will work. The locking code/strategy in LNet will not change.

We can compare the two approaches and assess which one will give the most benefit.

Comment by Shuichi Ihara [ 13/Jun/21 ]

I see. we had an option to disable strict cpt binding with <servivce>_cpu_bind=0 which introduced by LU-11454.
At least, upper layer services could make more busy CPUs even there are limited destination of NIDs.
Now question is disabling CPT binding vs binding well distributed CPTs by NID. We need some benchmarks to compare.

Comment by Serguei Smirnov [ 16/Jun/21 ]

This has exposed issues in the LND design. 

Currently in the LND the CPT is allocated for the connection to the router and doesn't change while the connection is up, even though the connection is handling txs to multiple remote peers. It is proposed to change the mechanism of scheduler thread selection such that the final destination nid can be used to select a CPT and a scheduler from the pool for the CPT on per-tx basis. In other words, cpt_of_nid determines the cpt based on the NID. The scheduler is assigned to the connection on creation and doesn't change for the duration of the connection's life time. in case of a router the same connection to the router can be used to send to multiple final destinations. This causes the same scheduler to be used irregardless of the final destination. We need to change the scheduler selection for each final destination, and the best way to do that is to assign the scheduler at the time of sending the message. Both socklnd and o2iblnd will be redesigned as a part of this effort to achieve a more even distribution of load across CPTs. The level of design and development effort is going to be at the level of a feature request.

Comment by Andreas Dilger [ 09/Dec/21 ]

Assuming that the last NID in the above list should be 172.16.171.67 (or 172.16.172.67) instead of 172.16.170.67 twice, the simple "sum of NID bytes mod 20" gives a uniform distribution across the 20 CPTs:

for N in 172.16.167.67 172.16.173.67 172.16.174.67 172.16.175.67 172.16.168.67
          172.16.176.67 172.16.177.67 172.16.169.67 172.16.170.67 172.16.171.67; do
        echo $(( (${N//./+}) % 20 ));
done  | sort -n
2
3
4
5
6
8
9
10
11
12

It wouldn't quite be perfect for 10 CPTs, with one duplicate CPT and one unused CPT, but still better than the current algorithm.

Comment by Patrick Farrell [ 09/Dec/21 ]

For what it's worth, I'm sure we can do better, but for example, a perfectly random "choose 10 from 20 without removal" generates 2.1 collisions.  The result of our hash algorithm in the example above - which I agree is badly written and should probably be replaced - is 3.  That's hardly that far out - if 2 is most common, 3 would happen quite often.

The problem is that what we want is a uniform distribution where the client NIDs are spread evenly across the CPTs.  We use pseudo-random as a proxy for uniform because it's quick to calculate and for large node counts it works fine in practice.  The problem becomes acute when node counts and CPT counts are of similar magnitudes.

Andreas,

Yours generates a uniform distribution because of the modulus and the relatively small range of values, I think.  Given random inputs, the modulus would generate a random distribution just mapped in to the CPT space.  Given ordered inputs, it would generate a uniform output.  This input is basically ordered - they all differ only in the third octet, the other three octets are the same.

(FWIW, you accidentally wrote mod 256 instead of mod 20, but you used 20 in your script.)

I mean perhaps (NID_value) % (num_CPTs) is a reasonable thing.  The issue is if someone had NIDs that differed by a value of num_cpts, they'd all pile up on the same CPT.  (Like, literally every single one.)  That's presumably unacceptable.

I have a thought on this I'll put in a separate comment.

Comment by Patrick Farrell [ 09/Dec/21 ]

Here's a simple way to guarantee a uniform distribution that should also be perfectly scalable (ie, no lookups or anything), but it requires a protocol change, I think.

If each client connection (at the LNet level) is assigned a unique number at the time of connection*, we can just do % that number to determine CPT.  It would be simple, fast, reliable, and would generate a uniform distribution regardless of what the client NIDs looked like.  Note the key here - uniform.  If we have 10 clients and 20 CPTs, we are guaranteed 1 client per CPT on 10 distinct CPTs.

Same if we have 10,000 clients and 20 CPTs - the distribution is perfectly uniform, and mapping client to CPT requires a single modulus operation on the number presented by the client.

*(since it's part of the original connection, it's OK that this would require the overhead of managing a global (or per-NIC or whatever) counter - it would just be at connection time)

Comment by Andreas Dilger [ 10/Dec/21 ]

I definitely understand that my "simple hash" generates a uniform distribution because of the way the input NIDs change by uniform increments, but my expectation is that this is actually the most common case in an HPC cluster - clients either having sequential NIDs on the same subnet (single interface case), or in this case where multiple NIDs on the same client have the same low octet ("node number") but the second octet is on different sequential subnets.

Playing the devil's advocate for a minute - the assignment of a sequential integer to the connection is perfectly uniform as long as no clients disconnect. After that, depending on the client connection pattern the NIDs may become non-uniform, but it should only be an issue in pathological cases. Even with disconnect/reconnect, each group of new connections should be uniformly distributed across the CPTs, so the only way this could become problematic is if there is some systematic reason clients are disconnecting from one CPT and connecting to a new one. That would typically be attributed to a problem with the CPT (i.e. it is overloaded or something), so that isn't a bad thing if it happens.

The other drawback of the counter that the CPT selection is non-deterministic (depends on order of peer connections, and previous connection history), so in some cases e.g. 8 NIDs would use CPT 0-7 and then if the NIDs reboot and reconnect they will use CPT 8-15, which may have performance pathologies (NUMA socket affinity vs. NVMe PCI bus, etc.) that could be complex to debug, while the simple hash is at least deterministic for each client.

I'm not against your approach, just bringing up some things to ponder. I suspect that the difference between the two methods is only of interest when client connections <= CPTs, and will be largely the same in the majority of cases.

Comment by Gerrit Updater [ 20/Jan/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46233
Subject: LU-14676 lnet: improve hash distribution across CPTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 30961583a93f4f3261d91991ac1abfcd5cabed47

Comment by Gerrit Updater [ 31/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46233/
Subject: LU-14676 lnet: improve hash distribution across CPTs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9b6e27755507b9bb47a1d7b4aede6302a876a14d

Comment by Peter Jones [ 31/Jan/22 ]

Landed for 2.15

Comment by Gerrit Updater [ 22/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50382
Subject: LU-14676 lnet: improve hash distribution across CPTs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 569cdcefca46d48007e2639921f420ca8d71a3df

Generated at Sat Feb 10 03:11:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.