[LU-11937] lnet.service randomly load tcp NIDs Created: 06/Feb/19 Updated: 12/Nov/19 Resolved: 12/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 (3.10.0-957.5.1.el7.x86_64), Lustre 2.12.0 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On clients, we're using lnet.service with the following config: [root@sh-112-12 ~]# cat /etc/lnet.conf
net:
- net type: o2ib4
local NI(s):
- nid:
interfaces:
0: ib0
route:
- net: o2ib1
gateway: 10.9.0.[31-32]@o2ib4
- net: o2ib5
gateway: 10.9.0.[41-42]@o2ib4
- net: o2ib7
gateway: 10.9.0.[21-24]@o2ib4
[root@sh-112-12 ~]# lctl list_nids
10.10.112.12@tcp
10.9.112.12@o2ib4
[root@sh-112-12 ~]# dmesg | grep -i lnet [ 397.762804] LNet: HW NUMA nodes: 2, HW CPU cores: 20, npartitions: 2 [ 398.995449] LNet: 13837:0:(socklnd.c:2655:ksocknal_enumerate_interfaces()) Ignoring interface enp4s0f1 (down) [ 399.005708] LNet: Added LNI 10.10.112.12@tcp [8/256/0/180] [ 399.011316] LNet: Accept secure, port 988 [ 399.060725] LNet: Using FastReg for registration [ 399.075936] LNet: Added LNI 10.9.112.12@o2ib4 [8/256/0/180] It is unclear why it does that at this point.
client network config: [root@sh-112-12 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 0c:c4:7a:dc:96:ae brd ff:ff:ff:ff:ff:ff
inet 10.10.112.12/16 brd 10.10.255.255 scope global enp4s0f0
valid_lft forever preferred_lft forever
inet6 fe80::ec4:7aff:fedc:96ae/64 scope link
valid_lft forever preferred_lft forever
3: enp4s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 0c:c4:7a:dc:96:af brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 20:00:10:8b:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a0:9e:20 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.9.112.12/16 brd 10.9.255.255 scope global ib0
valid_lft forever preferred_lft forever
inet6 fe80::268a:703:a0:9e20/64 scope link
valid_lft forever preferred_lft forever
lnet.service origin: [root@sh-112-12 ~]# rpm -qf /usr/lib/systemd/system/lnet.service lustre-client-2.12.0-1.el7.x86_64 [root@sh-112-12 ~]# rpm -q --info lustre-client Name : lustre-client Version : 2.12.0 Release : 1.el7 Architecture: x86_64 Install Date: Wed 06 Feb 2019 10:13:52 AM PST Group : System Environment/Kernel Size : 2007381 License : GPL Signature : (none) Source RPM : lustre-client-2.12.0-1.el7.src.rpm Build Date : Fri 21 Dec 2018 01:53:18 PM PST Build Host : trevis-307-el7-x8664-3.trevis.whamcloud.com Relocations : (not relocatable) URL : https://wiki.whamcloud.com/ Summary : Lustre File System Description : Userspace tools and files for the Lustre file system. [root@sh-112-12 ~]# cat /usr/lib/systemd/system/lnet.service [Unit] Description=lnet management Requires=network-online.target After=network-online.target openibd.service rdma.service ConditionPathExists=!/proc/sys/lnet/ [Service] Type=oneshot RemainAfterExit=true ExecStart=/sbin/modprobe lnet ExecStart=/usr/sbin/lnetctl lnet configure ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf ExecStop=/usr/sbin/lustre_rmmod ptlrpc ExecStop=/usr/sbin/lnetctl lnet unconfigure ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs [Install] WantedBy=multi-user.target This leads to many issues server-side with 2.12, as reported in LU-11888 and LU-11936. Thanks! |
| Comments |
| Comment by Peter Jones [ 06/Feb/19 ] |
|
Related to other tickets Sonia is working on |
| Comment by Stephane Thiell [ 06/Feb/19 ] |
|
Thanks, we're trying this drop-in file as a workaround on all clients: /etc/systemd/system/lnet.service.d/deps.conf [Unit] After=dkms.service network.service [Service] # we don't want tcp nids ExecStartPost=-/usr/sbin/lnetctl net del --net tcp |
| Comment by Stephane Thiell [ 11/Nov/19 ] |
|
Hello Peter, This problem is still there and caused some trouble for us last weekend. Apparently, depiste our lnet service workaround on the clients, a tcp NID was able to make its way to Fir servers (2.12.3) which after a few hours caused a MDT deadlock. NOTE: For us, this is a major blocker for migrating Oak from 2.10 to 2.12, as multi-rail can cause this kind of issues, especially when storage and compute are separated and managed by different teams. A misconfigured client can cause such trouble on the server side. Note that in our case, we don't want any tcp NID at all but there is no way to avoid that in 2.12 as far as I know. In 2.10, there is no risk of having this situation on the servers. We tracked down the problem today to the lnet.service script, which induces a race between lnet configure and lnet import:
# cat /usr/lib/systemd/system/lnet.service [Unit] Description=lnet management Requires=network-online.target After=network-online.target openibd.service rdma.service opa.service ConditionPathExists=!/proc/sys/lnet/ [Service] Type=oneshot RemainAfterExit=true ExecStart=/sbin/modprobe lnet ExecStart=/usr/sbin/lnetctl lnet configure ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf ExecStop=/usr/sbin/lustre_rmmod ptlrpc ExecStop=/usr/sbin/lnetctl lnet unconfigure ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs [Install] WantedBy=multi-user.target Even with our workaround, which works most of the case though, some clients can show up with a tcp0 NID at lnet configure and thus there is a risk of announcing themselves with a tcp NID when the filesystem tries to mount itself: 2019-11-05T22:10:26-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script... 2019-11-05T22:10:26-08:00 sh-117-11 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.10.117.11@tcp [8/256/0/180] 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Accept secure, port 988 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Using FastReg for registration 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.9.117.11@o2ib4 [8/256/0/180] 2019-11-05T22:10:29-08:00 sh-117-11 kernel: LNet: Removed LNI 10.10.117.11@tcp 2019-11-05T22:10:43-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 38 seconds 2019-11-05T22:11:06-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 61 seconds 2019-11-05T22:11:08-08:00 sh-117-11 shine: Starting shine: WARNING: Nothing to mount on sh-117-11 for `regal' 2019-11-05T22:11:08-08:00 sh-117-11 shine: WARNING: Nothing was done for `regal'. 2019-11-05T22:11:08-08:00 sh-117-11 shine: Mount of fir on /scratch failed 2019-11-05T22:11:08-08:00 sh-117-11 shine: >> mount.lustre: mount 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir at /scratch failed: Input/output error 2019-11-05T22:11:08-08:00 sh-117-11 shine: Is the MGS running? 2019-11-05T22:11:08-08:00 sh-117-11 shine: [FAILED] 2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service: control process exited, code=exited status=16 2019-11-05T22:11:08-08:00 sh-117-11 systemd: Failed to start SYSV: Lustre shine mounting script. 2019-11-05T22:11:08-08:00 sh-117-11 systemd: Unit shine.service entered failed state. 2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service failed. 2019-11-05T22:11:33-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script... Then, the servers keep trying to contact these clients to the erroneous tcp NID, while they don't even have a tcp interface nor a route for those, and we end up having problems like these: [Sat Nov 9 23:58:14 2019][461981.327905] LustreError: 80884:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0fe1e7f7400 ns: mdt-fir-MDT0000_UUID lock: ffffa11943671200/0x675684f65f9baf7 lrc: 3/0,0 mode: CR/CR res: [0x200038966:0x418:0x0].0x0 bits 0x9/0x0 rrc: 2 type: IBT flags: 0x50200000000000 nid: 10.10.117.11@tcp remote: 0x541e831b11b117da expref: 250 pid: 80884 timeout: 0 lvb_type: 0^M We also think that the backtraces below are due to MDT threads being stuck with tcp NIDs: [Sun Nov 10 06:24:59 2019][485187.377753] Pid: 67098, comm: mdt03_047 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Sun Nov 10 06:25:00 2019][485187.388020] Call Trace:^M [Sun Nov 10 06:25:00 2019][485187.390566] [<ffffffffc10ccb75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.397597] [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.404884] [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.411808] [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.418892] [<ffffffffc15da40d>] mdt_getattr_name_lock+0x101d/0x1c30 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.425989] [<ffffffffc15e1d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.432638] [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.439230] [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.446073] [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.453272] [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.459514] [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.466549] [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.474357] [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.480780] [<ffffffffbe8c2e81>] kthread+0xd1/0xe0^M [Sun Nov 10 06:25:00 2019][485187.485775] [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Sun Nov 10 06:25:00 2019][485187.492327] [<ffffffffffffffff>] 0xffffffffffffffff^M Yesterday, MDT0 on Fir went down and was completely hung, with previously thousands messages having tcp NIDs and backtrace like above, all of that apparently due to 2 clients announcing themselves as having a tcp NID. Please advice if there is a way to completely disable multi-rail and avoid this situation. I would recommend to increase the severity of this issue as this has caused a lot of trouble since 2.12, but I'm glad we're finally making progress. Thanks much! |
| Comment by Stephane Thiell [ 12/Nov/19 ] |
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
peer:
- primary nid: 10.10.23.14@tcp
Multi-Rail: True
peer ni:
- nid: 10.8.23.14@o2ib6
state: NA
- nid: 10.10.23.14@tcp
state: NA
[root@fir-md1-s1 fir-MDT0000]# lctl ping 10.10.23.14@tcp
failed to ping 10.10.23.14@tcp: Input/output error
[root@fir-md1-s1 fir-MDT0000]# lctl ping 10.8.23.14@o2ib6
12345-0@lo
12345-10.8.23.14@o2ib6
I was able to manually remove the TCP nid with this: [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.23.14@tcp --nid 10.10.23.14@tcp
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
show:
- peer:
errno: -2
descr: "cannot get peer information: No such file or directory"
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
show:
- peer:
errno: -2
descr: "cannot get peer information: No such file or directory"
[root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
peer:
- primary nid: 10.8.23.14@o2ib6
Multi-Rail: True
peer ni:
- nid: 10.8.23.14@o2ib6
state: NA
|
| Comment by Amir Shehata (Inactive) [ 12/Nov/19 ] |
|
If the TCP network is configured on the node, then it'll be propagated due to the discovery feature. There are three solutions: 1) remove the tcp nid from all the nodes if you don't need it. 2) Turn off discovery on all the nodes. 3) Explicitly configure the peers (but that would be a lot of config). I believe you can use ip2nets syntax though. |
| Comment by Stephane Thiell [ 12/Nov/19 ] |
|
Amir, our lnet.conf on the clients is as follow: [root@sh-101-01 ~]# cat /etc/lnet.conf
global:
- retry_count: 0
- health_sensitivity: 0
- transaction_timeout: 10
net:
- net type: o2ib4
local NI(s):
- nid:
interfaces:
0: ib0
route:
- net: o2ib5
gateway: 10.9.0.[41-42]@o2ib4
- net: o2ib7
gateway: 10.9.0.[21-24]@o2ib4
But when lnet is loaded, it does a lnet configure before the import of that file, which I think might propagate a tcp NID in some rare case. How can we be sure to disable discovery everywhere without any race condition? How do you do that? We really don't use multi-rail at all in our case. Thanks!! |
| Comment by Amir Shehata (Inactive) [ 12/Nov/19 ] |
|
Hi Stephane, lnetctl lnet configure should not configure any networks. The default tcp would get configured if somewhere you're doing lctl net up. That would load the default tcp network. To disable discovery you can add options lnet lnet_peer_discovery_disabled=1 on all the nodes. My hunch at the moment is that there are some nodes which are using lctl net up or lnetctl lnet configure --all. This would lead to the tcp network being loaded, especially if you don't have an "options network" in your "modprobe.d/lnet.conf" file Would you be able to check that? |
| Comment by Stephane Thiell [ 12/Nov/19 ] |
|
Hi Amir, Thanks for your help! This was useful. I confirm that lnetctl lnet configure does not configure any networks, my bad! I guess we've just figured out what was wrong in our setup and you were very close: a service to mount our Lustre filesystems was doing a modprobe lustre and in some (rare) cases, the filesystem mount that followed was done at the same time as lnet.service, leading to a tcp NID to be propagated to the servers and causing the trouble I described on the server-side. We have fixed the dependencies of boot time services on our clients so this hopefully should not happen anymore! modprobe lustre seems to do the same as lnet net up and does configure a default tcp network if lnet is not configured yet. Sorry for the noise, after all, it looks like this was never a problem of lnet.service. |
| Comment by Peter Jones [ 12/Nov/19 ] |
|
Good news - so can we consider this ticket resolved? |
| Comment by Stephane Thiell [ 12/Nov/19 ] |
|
Yes, good news! We appreciated the help, thanks! Sorry it took us so much time to figure that out. We've also added some "rogue NID monitoring" on the server side, just in case some clients continue to be misconfigured. We prefer to let the default lnet discovery enabled for now, but it's good to know that we do have the option to disable it we want to. I'm ok to consider this ticket resolved at this point. |
| Comment by Peter Jones [ 12/Nov/19 ] |
|
ok - thanks |