[LU-14875] LNet multirail and interface binding Created: 21/Jul/21  Updated: 23/Sep/22  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Lustre Bull Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: None
Environment:

RedHat 8.3
kernel 4.18.0-240.10.1.el8_3.x86_64
lustre 2.12.6


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On a machine with 4 IB interfaces, I would like to create a LNet multirail configuration that takes into account NUMA location of each interface, in order to get the highest LNet performance.

I have tried several lnet configuration but none of them allow a local binding of each interface.

 

Here is the NUMA description of the machine. The IB devices ib0, ib1, ib2, ib3 are located on NUMA node 1, 3, 5 and 7 respectively.

 

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 48 49 50 51 52 53
node 0 size: 63832 MB
node 0 free: 60103 MB
node 1 cpus: 6 7 8 9 10 11 54 55 56 57 58 59
node 1 size: 64268 MB
node 1 free: 39220 MB
node 2 cpus: 12 13 14 15 16 17 60 61 62 63 64 65
node 2 size: 64317 MB
node 2 free: 61323 MB
node 3 cpus: 18 19 20 21 22 23 66 67 68 69 70 71
node 3 size: 64281 MB
node 3 free: 61558 MB
node 4 cpus: 24 25 26 27 28 29 72 73 74 75 76 77
node 4 size: 64269 MB
node 4 free: 60741 MB
node 5 cpus: 30 31 32 33 34 35 78 79 80 81 82 83
node 5 size: 64305 MB
node 5 free: 62450 MB
node 6 cpus: 36 37 38 39 40 41 84 85 86 87 88 89
node 6 size: 64275 MB
node 6 free: 63133 MB
node 7 cpus: 42 43 44 45 46 47 90 91 92 93 94 95
node 7 size: 64337 MB
node 7 free: 62429 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  12  12  12  32  32  32  32
  1:  12  10  12  12  32  32  32  32
  2:  12  12  10  12  32  32  32  32
  3:  12  12  12  10  32  32  32  32
  4:  32  32  32  32  10  12  12  12
  5:  32  32  32  32  12  10  12  12
  6:  32  32  32  32  12  12  10  12
  7:  32  32  32  32  12  12  12  10

# grep . /sys/class/net/ib*/device/numa_node
/sys/class/net/ib0/device/numa_node:1
/sys/class/net/ib1/device/numa_node:3
/sys/class/net/ib2/device/numa_node:5
/sys/class/net/ib3/device/numa_node:7

 

By default, the libcfs module configures 8 CPTs

 

# modprobe -v libcfs
insmod /lib/modules/4.18.0-240.10.1.el8_3.x86_64/weak-updates/lustre-client/net/libcfs.ko

# lctl get_param cpu_partition_table
cpu_partition_table=
0       : 0 1 2 3 4 5 48 49 50 51 52 53
1       : 6 7 8 9 10 11 54 55 56 57 58 59
2       : 12 13 14 15 16 17 60 61 62 63 64 65
3       : 18 19 20 21 22 23 66 67 68 69 70 71
4       : 24 25 26 27 28 29 72 73 74 75 76 77
5       : 30 31 32 33 34 35 78 79 80 81 82 83
6       : 36 37 38 39 40 41 84 85 86 87 88 89
7       : 42 43 44 45 46 47 90 91 92 93 94 95

 

With configuration 1, no LNet binding is specified and we observe each interface is bound to every CPTs

 

# modprobe -v lnet
insmod /lib/modules/4.18.0-240.10.1.el8_3.x86_64/weak-updates/lustre-client/net/lnet.ko networks=o2ib(ib0,ib1,ib2,ib3)

# lctl net up
LNET configured

# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 14.128.0.45@o2ib
          status: up
          interfaces:
              0: ib0
        - nid: 14.128.0.46@o2ib
          status: up
          interfaces:
              0: ib1
        - nid: 14.128.0.47@o2ib
          status: up
          interfaces:
              0: ib2
        - nid: 14.128.0.48@o2ib
          status: up
          interfaces:
              0: ib3

# lnetctl net show --verbose | grep -E 'ib|CPT|dev'
          dev cpt: 0
          CPT: "[0,1,2,3,4,5,6,7]"
    - net type: o2ib
        - nid: 14.128.0.45@o2ib
              0: ib0
          dev cpt: 1
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.46@o2ib
              0: ib1
          dev cpt: 3
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.47@o2ib
              0: ib2
          dev cpt: 5
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.48@o2ib
              0: ib3
          dev cpt: 7
          CPT: "[0,1,2,3,4,5,6,7]"

 

With configuration 2, LNet binding is specified as [1,3,5,7] and we observe each interface is bound to CPTs 1,3,5 and 7. It is better, but still not optimal for the performance.

 

# modprobe -v lnet
insmod /lib/modules/4.18.0-240.10.1.el8_3.x86_64/weak-updates/lustre-client/net/lnet.ko networks=o2ib(ib0,ib1,ib2,ib3)[1,3,5,7]

# lctl net up
LNET configured

# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 14.128.0.45@o2ib
          status: up
          interfaces:
              0: ib0
        - nid: 14.128.0.46@o2ib
          status: up
          interfaces:
              0: ib1
        - nid: 14.128.0.47@o2ib
          status: up
          interfaces:
              0: ib2
        - nid: 14.128.0.48@o2ib
          status: up
          interfaces:
              0: ib3

# lnetctl net show --verbose | grep -E 'ib|CPT|dev'
          dev cpt: 0
          CPT: "[0,1,2,3,4,5,6,7]"
    - net type: o2ib
        - nid: 14.128.0.45@o2ib
              0: ib0
          dev cpt: 1
          CPT: "[1,3,5,7]"
        - nid: 14.128.0.46@o2ib
              0: ib1
          dev cpt: 3
          CPT: "[1,3,5,7]"
        - nid: 14.128.0.47@o2ib
              0: ib2
          dev cpt: 5
          CPT: "[1,3,5,7]"
        - nid: 14.128.0.48@o2ib
              0: ib3
          dev cpt: 7
          CPT: "[1,3,5,7]"

 

Finally with configuration 3, a fine NUMA binding is specified through a lnetctl yaml import, but it seems not taken into account.

# modprobe -v lnet
insmod /lib/modules/4.18.0-240.10.1.el8_3.x86_64/weak-updates/lustre-client/net/lnet.ko networks=""

# lctl net up
LNET configured

# lnetctl net del --net tcp
# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up

# cat lnetctl.config.txt
net:
    - net type: o2ib
      local NI(s):
        - nid: 14.128.0.45@o2ib
          interfaces:
              0: ib0
          CPT: "[1]"
        - nid: 14.128.0.46@o2ib
          interfaces:
              0: ib1
          CPT: "[3]"
        - nid: 14.128.0.47@o2ib
          interfaces:
              0: ib2
          CPT: "[5]"
        - nid: 14.128.0.48@o2ib
          interfaces:
              0: ib3
          CPT: "[7]"

# lnetctl import lnetctl.config.txt
# echo $?
0

# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 14.128.0.45@o2ib
          status: up
          interfaces:
              0: ib0
        - nid: 14.128.0.46@o2ib
          status: up
          interfaces:
              0: ib1
        - nid: 14.128.0.47@o2ib
          status: up
          interfaces:
              0: ib2
        - nid: 14.128.0.48@o2ib
          status: up
          interfaces:
              0: ib3

# lnetctl net show --verbose
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7]"
    - net type: o2ib
      local NI(s):
        - nid: 14.128.0.45@o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 1
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.46@o2ib
          status: up
          interfaces:
              0: ib1
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 3
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.47@o2ib
          status: up
          interfaces:
              0: ib2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 5
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7]"
        - nid: 14.128.0.48@o2ib
          status: up
          interfaces:
              0: ib3
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 7
          tcp bonding: 0
          CPT: "[0,1,2,3,4,5,6,7]"

Why the CPT specified for each interface of the multirail LNet interface has not been taken into account ?

 

 

 

 



 Comments   
Comment by Peter Jones [ 21/Jul/21 ]

Cyril

Could you please advise here?

Thanks

Peter

Comment by Andreas Dilger [ 22/Jul/21 ]

Does the recently-landed "LU-9121 lnet: User Defined Selection Policy (UDSP)" feature in 2.15 provide the requested functionality? The UDSP design documentation is currently available, but there is not yet any update to the Lustre Manual (tracked under LUDOC-438.

Comment by Lustre Bull [ 23/Jul/21 ]

I read through the UDSP design documentation mentioned above. It seems to me this does not provide the requested functionality.

As I understand it, the UDSP feature allows to setup policies affecting LNet interface selection at runtime. Although, the requested functionality is related to CPT configuration and binding of LNet interfaces in a multi-rail LNet network.

Comment by Gregoire Pichon [ 23/Jul/21 ]

I have just realized that I was using the impersonal "Lustre Bull" account to create the Jira ticket and post comments. I am sorry about that. From now on, I will use my own Jira account to continue the discussion.

Grégoire.

Comment by Cyril Bordage [ 27/Jul/21 ]

Hello Grégoire !

You are right, it does not seem you could achieve what you expect with UDSP.  I will study how we can provide a solution for your goal.

Comment by Stephen Champion [ 15/Aug/21 ]

Restricting the CPT of lnds used for an interface via yaml has never worked as expected. lnetctl will show the CPT correctly, but not import it. Adding interfaces individually with lnetctl and using the --cpt option to set the binding is working as expected.

I think that fixing the yaml import function will address Gregoire's needs.

Comment by Cyril Bordage [ 10/Sep/21 ]

pichong, do you confirm you can achieve your goal by adding the interfaces individually?

Comment by Gregoire Pichon [ 16/Sep/21 ]

Yes, I confirm that adding interfaces individually with lnetctl and option --cpt sets the binding, as expected. Thanks Stephen for the suggestion. Nevertheless, I would have expected the yaml import to do it also.

Comment by Cyril Bordage [ 16/Sep/21 ]

Sure, it will be fixed. I just wanted to be sure you had another working way to do that.

Comment by Gerrit Updater [ 17/Feb/22 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46541
Subject: LU-14875 import: fix bad CPT read
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 50f2cf5649ad5e5e172e74a9f283c53a7ff8dbf7

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46541/
Subject: LU-14875 import: fix bad CPT read
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ad5c43f4a53f8679cfa1a60f8161b08d3dcfa66

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:13:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.