Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.13.0
-
None
-
Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.
-
3
-
9223372036854775807
Description
Hello,
We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
Thanks,
Mike
Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
[root@cannonlnet07 ~]# nslookup 10.31.160.253
253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.
[root@cannonlnet07 ~]# nslookup 10.31.179.178
178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.
[root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
options lnet forwarding="enabled"
options lnet lnet_peer_discovery_disabled=1
[root@cannonlnet07 ~]# lnetctl net show
net:
- net type: lo
local NI(s): - nid: 0@lo
status: up - net type: o2ib
local NI(s): - nid: 10.31.160.253@o2ib
status: up
interfaces:
0: ib1 - net type: o2ib2
local NI(s): - nid: 10.31.160.253@o2ib2
status: up
interfaces:
0: ib1 - net type: o2ib4
local NI(s): - nid: 10.31.179.178@o2ib4
status: up
interfaces:
0: ib0 - net type: tcp
local NI(s): - nid: 10.31.8.93@tcp
status: down
interfaces:
0: bond0 - net type: tcp4
local NI(s): - nid: 10.31.73.39@tcp4
status: down
interfaces:
0: bond0.2475
[root@cannonlnet07 ~]# lnetctl stats show
statistics:
msgs_alloc: 1517
msgs_max: 16396
rst_alloc: 568
errors: 0
send_count: 9287639
resend_count: 12378
response_timeout_count: 28115
local_interrupt_count: 0
local_dropped_count: 24050
local_aborted_count: 0
local_no_route_count: 0
local_timeout_count: 14188
local_error_count: 0
remote_dropped_count: 3862
remote_error_count: 0
remote_timeout_count: 0
network_timeout_count: 0
recv_count: 9287639
route_count: 2744617426
drop_count: 50252
send_length: 1039854144
recv_length: 283232
route_length: 125943144442551
drop_length: 24066088
[root@cannonlnet07 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 3
transaction_timeout: 10
health_sensitivity: 100
recovery_interval: 1
router_sensitivity: 100
[root@cannonlnet07 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0x98039b0300907de0
System image GUID: 0x98039b0300907de0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 1422
LMC: 0
SM lid: 1434
Capability mask: 0x2651e848
Port GUID: 0x98039b0300907de0
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0x98039b0300907de1
System image GUID: 0x98039b0300907de0
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 2259
LMC: 0
SM lid: 158
Capability mask: 0x2651e848
Port GUID: 0x98039b0300907de1
Link layer: InfiniBand
[root@cannonlnet07 ~]# ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 343090 bytes 34310200 (32.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 112049 bytes 6723124 (6.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@cannonlnet07 ~]# ifconfig ib1
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 495909 bytes 50980886 (48.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 18846 bytes 1130904 (1.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Both changes referenced above (
LU-13071andLU-13145) are in b2_12.