[LU-11937] lnet.service randomly load tcp NIDs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6 (3.10.0-957.5.1.el7.x86_64), Lustre 2.12.0

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On clients, we're using lnet.service with the following config:

[root@sh-112-12 ~]# cat /etc/lnet.conf 
net:
    - net type: o2ib4
      local NI(s):
        - nid:
          interfaces:
              0: ib0
route: 
    - net: o2ib1
      gateway: 10.9.0.[31-32]@o2ib4
    - net: o2ib5
      gateway: 10.9.0.[41-42]@o2ib4
    - net: o2ib7
      gateway: 10.9.0.[21-24]@o2ib4
[root@sh-112-12 ~]# lctl list_nids
10.10.112.12@tcp
10.9.112.12@o2ib4

[root@sh-112-12 ~]# dmesg | grep -i lnet
[  397.762804] LNet: HW NUMA nodes: 2, HW CPU cores: 20, npartitions: 2
[  398.995449] LNet: 13837:0:(socklnd.c:2655:ksocknal_enumerate_interfaces()) Ignoring interface enp4s0f1 (down)
[  399.005708] LNet: Added LNI 10.10.112.12@tcp [8/256/0/180]
[  399.011316] LNet: Accept secure, port 988
[  399.060725] LNet: Using FastReg for registration
[  399.075936] LNet: Added LNI 10.9.112.12@o2ib4 [8/256/0/180]

It is unclear why it does that at this point.

client network config:

[root@sh-112-12 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:dc:96:ae brd ff:ff:ff:ff:ff:ff
    inet 10.10.112.12/16 brd 10.10.255.255 scope global enp4s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fedc:96ae/64 scope link 
       valid_lft forever preferred_lft forever
3: enp4s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:c4:7a:dc:96:af brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 20:00:10:8b:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a0:9e:20 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.9.112.12/16 brd 10.9.255.255 scope global ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::268a:703:a0:9e20/64 scope link 
       valid_lft forever preferred_lft forever

lnet.service origin:

[root@sh-112-12 ~]# rpm -qf /usr/lib/systemd/system/lnet.service 
lustre-client-2.12.0-1.el7.x86_64
[root@sh-112-12 ~]# rpm -q --info lustre-client
Name        : lustre-client
Version     : 2.12.0
Release     : 1.el7
Architecture: x86_64
Install Date: Wed 06 Feb 2019 10:13:52 AM PST
Group       : System Environment/Kernel
Size        : 2007381
License     : GPL
Signature   : (none)
Source RPM  : lustre-client-2.12.0-1.el7.src.rpm
Build Date  : Fri 21 Dec 2018 01:53:18 PM PST
Build Host  : trevis-307-el7-x8664-3.trevis.whamcloud.com
Relocations : (not relocatable)
URL         : https://wiki.whamcloud.com/
Summary     : Lustre File System
Description :
Userspace tools and files for the Lustre file system.
[root@sh-112-12 ~]# cat /usr/lib/systemd/system/lnet.service 
[Unit]
Description=lnet management

Requires=network-online.target
After=network-online.target openibd.service rdma.service

ConditionPathExists=!/proc/sys/lnet/

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target

This leads to many issues server-side with 2.12, as reported in LU-11888 and LU-11936.

Thanks!
Stephane

Attachments

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Feb/19 6:56 PM

Updated:: 12/Nov/19 11:39 PM

Resolved:: 12/Nov/19 11:39 PM