socklnd needs improved interface selection and configuration (LU-14064)

[LU-13621] LNET peer doesn't distribute well to different CPT Created: 02/Jun/20  Updated: 22/Mar/23  Resolved: 05/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.15.0

Type: Technical task Priority: Minor
Reporter: Shuichi Ihara Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None
Environment:

a server (1 x IB-EDR) and a client (2 x IB-HDR100) and MR enabled


Issue Links:
Related
is related to LU-14676 Better hash distribution to different... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

If server has more than one CPT, each peer connection should be able to distributed to different CPT as a load-balancing perspective.
An decision of CPT is based on a hash function with peer NID's address, but some cases, hash returns same value and both peers went to same CPT eventually.
This causes a critical performance problem since number of CPU core belongs to each CPT and if both peers go to single CPT on server to handle, a half of CPU are alway busy and other half of CPU are idle.

Here is an example.

server# cat /sys/kernel/debug/lnet/cpu_partition_table
0	: 0 1 2 3 4 5 6 7 8 9
1	: 10 11 12 13 14 15 16 17 18 19

server# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib10
      local NI(s):
        - nid: 10.0.11.224@o2ib10
          status: up
          interfaces:
              0: ib0

client # cat /sys/kernel/debug/lnet/cpu_partition_table
0	: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1	: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
2	: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
3	: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
4	: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
5	: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
6	: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
7	: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

client # lnetctl net show -v
    - net type: o2ib10
      local NI(s):
        - nid: 10.0.11.81@o2ib10
          status: up
          interfaces:
              0: ib0
 - snip -
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3]"

        - nid: 10.4.11.71@o2ib10
          status: up
          interfaces:
              0: ib4
- snip -
          lnd tunables:
          dev cpt: 4
          tcp bonding: 0
          CPT: "[4,5,6,7]"

on client.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                          
 20263 root      20   0       0      0      0 R  98.3   0.0   0:29.85 kiblnd_sd_06_01                                  
 20264 root      20   0       0      0      0 R  98.3   0.0   0:29.85 kiblnd_sd_06_02                                  
 20265 root      20   0       0      0      0 R  98.3   0.0   0:29.85 kiblnd_sd_06_03                                  
 20262 root      20   0       0      0      0 R  98.0   0.0   0:29.84 kiblnd_sd_06_00                                  
 20247 root      20   0       0      0      0 R  89.1   0.0   1:19.11 kiblnd_sd_02_01                                  
 20248 root      20   0       0      0      0 R  88.7   0.0   1:19.20 kiblnd_sd_02_02                                  
 20249 root      20   0       0      0      0 R  88.7   0.0   1:19.15 kiblnd_sd_02_03                                  
 20246 root      20   0       0      0      0 R  87.7   0.0   1:19.24 kiblnd_sd_02_00    

Two CPT are busy becouse of two interfaces.

On server

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                            
27651 root      20   0       0      0      0 R  86.0  0.0   2:22.27 kiblnd_sd_00_00                                    
27652 root      20   0       0      0      0 R  86.0  0.0   2:22.30 kiblnd_sd_00_01                                    
27653 root      20   0       0      0      0 R  86.0  0.0   2:22.27 kiblnd_sd_00_02                                    
27654 root      20   0       0      0      0 R  85.4  0.0   2:22.28 kiblnd_sd_00_03  

Only an CPT is busy even for two peers are connected to server.

Amir added an debug patch and confirmed both peers went to first CPT.

00000800:00000200:18.0:1591055201.186835:0:20660:0:(o2iblnd.c:795:kiblnd_create_conn()) peer_ni = 10.0.11.81@o2ib10, ni = 10.0.11.224@o2ib10, cpt = 0
00000800:00000200:18.0:1591055201.189343:0:20660:0:(o2iblnd.c:795:kiblnd_create_conn()) peer_ni = 10.4.11.81@o2ib10, ni = 10.0.11.224@o2ib10, cpt = 0

The problem hash function retuns same value even client IP address chagned below, then both peers eventually go to same CPT on server if server has only single interface.

1407418001001297 nid1 of client 64 bit representation
1407418001263431 nid2 of client 64 bit rpresentation


 Comments   
Comment by Gerrit Updater [ 19/Jun/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39113
Subject: LU-13621 lnet: utility to print cpt number
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0ead63ee1edec2680656d2e1593cf56a637c222

Comment by Amir Shehata (Inactive) [ 20/Jun/20 ]

I added a command to print the cpt number (or index of the cpt if the NI is bound to a set of CPTs). I think it would be useful to be able to pull this information out without having to dive into the kernel.

Using this utility it shows that varying the first 2 octets of the IP address and the net name/number does not change the cpt value the NID is being hashed to. This is something to be aware of on existing installation. Depending on the addressing scheme the site uses, we could endup with a situation where all the NIDs are being hashed into the same CPT. This will create a problem with CPT locking and will create a problem at the LND, since we'll be picking a scheduler thread from the same CPT pool.

Comment by Andreas Dilger [ 21/Jun/21 ]

Shuichi, is it true that the CPT hash function is imbalanced even if there are multiple CPTs and multiple clients connecting (e.g. 32 clients connecting to a server with 4 CPTs)? There are always going to be cases where two clients will map to a single CPT (in this case 10.4.11.71 and 10.4.11.81) no matter which mapping function is used. However, it is a much bigger problem if, say, 32 clients with sequential NIDs are not uniformly distributed across the CPTs on the server, or within 1 of an even split.

Comment by Gerrit Updater [ 31/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/39113/
Subject: LU-13621 lnet: utility to print cpt number
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df6f17ee97ac47c949c1963ff8d57fb2d4becd06

Comment by Peter Jones [ 05/May/22 ]

Seems to be landed for 2.15

Comment by Gerrit Updater [ 22/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50381
Subject: LU-13621 lnet: utility to print cpt number
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 82a00420f7d45e68dcf57ae7979d17c1a5085b66

Generated at Sat Feb 10 03:02:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.