[LU-11937] lnet.service randomly load tcp NIDs Created: 06/Feb/19  Updated: 12/Nov/19  Resolved: 12/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 7.6 (3.10.0-957.5.1.el7.x86_64), Lustre 2.12.0


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On clients, we're using lnet.service with the following config:

[root@sh-112-12 ~]# cat /etc/lnet.conf 
net:
    - net type: o2ib4
      local NI(s):
        - nid:
          interfaces:
              0: ib0
route: 
    - net: o2ib1
      gateway: 10.9.0.[31-32]@o2ib4
    - net: o2ib5
      gateway: 10.9.0.[41-42]@o2ib4
    - net: o2ib7
      gateway: 10.9.0.[21-24]@o2ib4
[root@sh-112-12 ~]# lctl list_nids
10.10.112.12@tcp
10.9.112.12@o2ib4
[root@sh-112-12 ~]# dmesg | grep -i lnet
[  397.762804] LNet: HW NUMA nodes: 2, HW CPU cores: 20, npartitions: 2
[  398.995449] LNet: 13837:0:(socklnd.c:2655:ksocknal_enumerate_interfaces()) Ignoring interface enp4s0f1 (down)
[  399.005708] LNet: Added LNI 10.10.112.12@tcp [8/256/0/180]
[  399.011316] LNet: Accept secure, port 988
[  399.060725] LNet: Using FastReg for registration
[  399.075936] LNet: Added LNI 10.9.112.12@o2ib4 [8/256/0/180]

It is unclear why it does that at this point.

 

client network config:

[root@sh-112-12 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:c4:7a:dc:96:ae brd ff:ff:ff:ff:ff:ff
    inet 10.10.112.12/16 brd 10.10.255.255 scope global enp4s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fedc:96ae/64 scope link 
       valid_lft forever preferred_lft forever
3: enp4s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0c:c4:7a:dc:96:af brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 20:00:10:8b:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a0:9e:20 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.9.112.12/16 brd 10.9.255.255 scope global ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::268a:703:a0:9e20/64 scope link 
       valid_lft forever preferred_lft forever

 

lnet.service origin:

[root@sh-112-12 ~]# rpm -qf /usr/lib/systemd/system/lnet.service 
lustre-client-2.12.0-1.el7.x86_64
[root@sh-112-12 ~]# rpm -q --info lustre-client
Name        : lustre-client
Version     : 2.12.0
Release     : 1.el7
Architecture: x86_64
Install Date: Wed 06 Feb 2019 10:13:52 AM PST
Group       : System Environment/Kernel
Size        : 2007381
License     : GPL
Signature   : (none)
Source RPM  : lustre-client-2.12.0-1.el7.src.rpm
Build Date  : Fri 21 Dec 2018 01:53:18 PM PST
Build Host  : trevis-307-el7-x8664-3.trevis.whamcloud.com
Relocations : (not relocatable)
URL         : https://wiki.whamcloud.com/
Summary     : Lustre File System
Description :
Userspace tools and files for the Lustre file system.
[root@sh-112-12 ~]# cat /usr/lib/systemd/system/lnet.service 
[Unit]
Description=lnet management

Requires=network-online.target
After=network-online.target openibd.service rdma.service

ConditionPathExists=!/proc/sys/lnet/

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target

This leads to many issues server-side with 2.12, as reported in LU-11888 and LU-11936.

Thanks!
Stephane



 Comments   
Comment by Peter Jones [ 06/Feb/19 ]

Related to other tickets Sonia is working on

Comment by Stephane Thiell [ 06/Feb/19 ]

Thanks, we're trying this drop-in file as a workaround on all clients:

/etc/systemd/system/lnet.service.d/deps.conf

[Unit]
After=dkms.service network.service

[Service]
# we don't want tcp nids
ExecStartPost=-/usr/sbin/lnetctl net del --net tcp
Comment by Stephane Thiell [ 11/Nov/19 ]

Hello Peter,

This problem is still there and caused some trouble for us last weekend. Apparently, depiste our lnet service workaround on the clients, a tcp NID was able to make its way to Fir servers (2.12.3) which after a few hours caused a MDT deadlock.

NOTE: For us, this is a major blocker for migrating Oak from 2.10 to 2.12, as multi-rail can cause this kind of issues, especially when storage and compute are separated and managed by different teams. A misconfigured client can cause such trouble on the server side.  Note that in our case, we don't want any tcp NID at all but there is no way to avoid that in 2.12 as far as I know. In 2.10, there is no risk of having this situation on the servers.

We tracked down the problem today to the lnet.service script, which induces a race between lnet configure and lnet import:

 

# cat /usr/lib/systemd/system/lnet.service
[Unit]
Description=lnet management

Requires=network-online.target
After=network-online.target openibd.service rdma.service opa.service

ConditionPathExists=!/proc/sys/lnet/

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target
 

Even with our workaround, which works most of the case though, some clients can show up with a tcp0 NID at lnet configure and thus there is a risk of announcing themselves with a tcp NID when the filesystem tries to mount itself:

2019-11-05T22:10:26-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script...
2019-11-05T22:10:26-08:00 sh-117-11 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2
2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.10.117.11@tcp [8/256/0/180]
2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Accept secure, port 988
2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Using FastReg for registration
2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.9.117.11@o2ib4 [8/256/0/180]
2019-11-05T22:10:29-08:00 sh-117-11 kernel: LNet: Removed LNI 10.10.117.11@tcp
2019-11-05T22:10:43-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 38 seconds
2019-11-05T22:11:06-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 61 seconds
2019-11-05T22:11:08-08:00 sh-117-11 shine: Starting shine:  WARNING: Nothing to mount on sh-117-11 for `regal'
2019-11-05T22:11:08-08:00 sh-117-11 shine: WARNING: Nothing was done for `regal'.
2019-11-05T22:11:08-08:00 sh-117-11 shine: Mount of fir on /scratch failed
2019-11-05T22:11:08-08:00 sh-117-11 shine: >> mount.lustre: mount 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir at /scratch failed: Input/output error
2019-11-05T22:11:08-08:00 sh-117-11 shine: Is the MGS running?
2019-11-05T22:11:08-08:00 sh-117-11 shine: [FAILED]
2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service: control process exited, code=exited status=16
2019-11-05T22:11:08-08:00 sh-117-11 systemd: Failed to start SYSV: Lustre shine mounting script.
2019-11-05T22:11:08-08:00 sh-117-11 systemd: Unit shine.service entered failed state.
2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service failed.
2019-11-05T22:11:33-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script...

Then, the servers keep trying to contact these clients to the erroneous tcp NID, while they don't even have a tcp interface nor a route for those, and we end up having problems like these:

[Sat Nov  9 23:58:14 2019][461981.327905] LustreError: 80884:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0fe1e7f7400 ns: mdt-fir-MDT0000_UUID lock: ffffa11943671200/0x675684f65f9baf7 lrc: 3/0,0 mode: CR/CR res: [0x200038966:0x418:0x0].0x0 bits 0x9/0x0 rrc: 2 type: IBT flags: 0x50200000000000 nid: 10.10.117.11@tcp remote: 0x541e831b11b117da expref: 250 pid: 80884 timeout: 0 lvb_type: 0^M

We also think that the backtraces below are due to MDT threads being stuck with tcp NIDs:

[Sun Nov 10 06:24:59 2019][485187.377753] Pid: 67098, comm: mdt03_047 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
[Sun Nov 10 06:25:00 2019][485187.388020] Call Trace:^M
[Sun Nov 10 06:25:00 2019][485187.390566]  [<ffffffffc10ccb75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.397597]  [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.404884]  [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]^M
[Sun Nov 10 06:25:00 2019][485187.411808]  [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]^M
[Sun Nov 10 06:25:00 2019][485187.418892]  [<ffffffffc15da40d>] mdt_getattr_name_lock+0x101d/0x1c30 [mdt]^M
[Sun Nov 10 06:25:00 2019][485187.425989]  [<ffffffffc15e1d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]^M
[Sun Nov 10 06:25:00 2019][485187.432638]  [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]^M
[Sun Nov 10 06:25:00 2019][485187.439230]  [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.446073]  [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.453272]  [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.459514]  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.466549]  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.474357]  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]^M
[Sun Nov 10 06:25:00 2019][485187.480780]  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0^M
[Sun Nov 10 06:25:00 2019][485187.485775]  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Sun Nov 10 06:25:00 2019][485187.492327]  [<ffffffffffffffff>] 0xffffffffffffffff^M

Yesterday, MDT0 on Fir went down and was completely hung, with previously thousands messages having tcp NIDs and backtrace like above, all of that apparently due to 2 clients announcing themselves as having a tcp NID.

Please advice if there is a way to completely disable multi-rail and avoid this situation. I would recommend to increase the severity of this issue as this has caused a lot of trouble since 2.12, but I'm glad we're finally making progress. Thanks much!
 

Comment by Stephane Thiell [ 12/Nov/19 ]
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
peer:
    - primary nid: 10.10.23.14@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.8.23.14@o2ib6
          state: NA
        - nid: 10.10.23.14@tcp
          state: NA
[root@fir-md1-s1 fir-MDT0000]# lctl ping 10.10.23.14@tcp
failed to ping 10.10.23.14@tcp: Input/output error
[root@fir-md1-s1 fir-MDT0000]# lctl ping 10.8.23.14@o2ib6
12345-0@lo
12345-10.8.23.14@o2ib6

I was able to manually remove the TCP nid with this:

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.23.14@tcp --nid 10.10.23.14@tcp
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
show:
    - peer:
          errno: -2
          descr: "cannot get peer information: No such file or directory"

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
show:
    - peer:
          errno: -2
          descr: "cannot get peer information: No such file or directory"

[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
peer:
    - primary nid: 10.8.23.14@o2ib6
      Multi-Rail: True
      peer ni:
        - nid: 10.8.23.14@o2ib6
          state: NA
Comment by Amir Shehata (Inactive) [ 12/Nov/19 ]

If the TCP network is configured on the node, then it'll be propagated due to the discovery feature. There are three solutions: 1) remove the tcp nid from all the nodes if you don't need it. 2) Turn off discovery on all the nodes. 3) Explicitly configure the peers (but that would be a lot of config). I believe you can use ip2nets syntax though.

Comment by Stephane Thiell [ 12/Nov/19 ]

Amir, our lnet.conf on the clients is as follow:

[root@sh-101-01 ~]# cat /etc/lnet.conf 
global:
    - retry_count: 0
    - health_sensitivity: 0
    - transaction_timeout: 10
net:
    - net type: o2ib4
      local NI(s):
        - nid:
          interfaces:
              0: ib0
route: 
    - net: o2ib5
      gateway: 10.9.0.[41-42]@o2ib4
    - net: o2ib7
      gateway: 10.9.0.[21-24]@o2ib4

But when lnet is loaded, it does a lnet configure before the import of that file, which I think might propagate a tcp NID in some rare case.

How can we be sure to disable discovery everywhere without any race condition? How do you do that? We really don't use multi-rail at all in our case. Thanks!!

Comment by Amir Shehata (Inactive) [ 12/Nov/19 ]

Hi Stephane,

lnetctl lnet configure should not configure any networks. The default tcp would get configured if somewhere you're doing lctl net up. That would load the default tcp network.

To disable discovery you can add

options lnet  lnet_peer_discovery_disabled=1

on all the nodes.

My hunch at the moment is that there are some nodes which are using lctl net up or lnetctl lnet configure --all. This would lead to the tcp network being loaded, especially if you don't have an "options network" in your "modprobe.d/lnet.conf" file

Would you be able to check that?

Comment by Stephane Thiell [ 12/Nov/19 ]

Hi Amir,

Thanks for your help! This was useful. I confirm that lnetctl lnet configure does not configure any networks, my bad!

I guess we've just figured out what was wrong in our setup and you were very close: a service to mount our Lustre filesystems was doing a modprobe lustre and in some (rare) cases, the filesystem mount that followed was done at the same time as lnet.service, leading to a tcp NID to be propagated to the servers and causing the trouble I described on the server-side. We have fixed the dependencies of boot time services on our clients so this hopefully should not happen anymore!

modprobe lustre seems to do the same as lnet net up and does configure a default tcp network if lnet is not configured yet.

Sorry for the noise, after all, it looks like this was never a problem of lnet.service.

Comment by Peter Jones [ 12/Nov/19 ]

Good news - so can we consider this ticket resolved?

Comment by Stephane Thiell [ 12/Nov/19 ]

Yes, good news! We appreciated the help, thanks! Sorry it took us so much time to figure that out. We've also added some "rogue NID monitoring" on the server side, just in case some clients continue to be misconfigured. We prefer to let the default lnet discovery enabled for now, but it's good to know that we do have the option to disable it we want to. I'm ok to consider this ticket resolved at this point.

Comment by Peter Jones [ 12/Nov/19 ]

ok - thanks

Generated at Sat Feb 10 02:48:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.