[LU-13367] lnet_handle_local_failure messages every 10 min ? Created: 17/Mar/20  Updated: 15/Oct/20  Resolved: 15/Oct/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Michael Ethier (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.


Attachments: File lustre.log.gz     File lustre.log.gz    
Epic/Theme: lustre-2.13
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello,
We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
Thanks,
Mike

Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages

[root@cannonlnet07 ~]# nslookup 10.31.160.253
253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.

[root@cannonlnet07 ~]# nslookup 10.31.179.178
178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.

[root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
options lnet forwarding="enabled"
options lnet lnet_peer_discovery_disabled=1
[root@cannonlnet07 ~]# lnetctl net show
net:

  • net type: lo
    local NI(s):
  • nid: 0@lo
    status: up
  • net type: o2ib
    local NI(s):
  • nid: 10.31.160.253@o2ib
    status: up
    interfaces:
    0: ib1
  • net type: o2ib2
    local NI(s):
  • nid: 10.31.160.253@o2ib2
    status: up
    interfaces:
    0: ib1
  • net type: o2ib4
    local NI(s):
  • nid: 10.31.179.178@o2ib4
    status: up
    interfaces:
    0: ib0
  • net type: tcp
    local NI(s):
  • nid: 10.31.8.93@tcp
    status: down
    interfaces:
    0: bond0
  • net type: tcp4
    local NI(s):
  • nid: 10.31.73.39@tcp4
    status: down
    interfaces:
    0: bond0.2475
    [root@cannonlnet07 ~]# lnetctl stats show
    statistics:
    msgs_alloc: 1517
    msgs_max: 16396
    rst_alloc: 568
    errors: 0
    send_count: 9287639
    resend_count: 12378
    response_timeout_count: 28115
    local_interrupt_count: 0
    local_dropped_count: 24050
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 14188
    local_error_count: 0
    remote_dropped_count: 3862
    remote_error_count: 0
    remote_timeout_count: 0
    network_timeout_count: 0
    recv_count: 9287639
    route_count: 2744617426
    drop_count: 50252
    send_length: 1039854144
    recv_length: 283232
    route_length: 125943144442551
    drop_length: 24066088
    [root@cannonlnet07 ~]# lnetctl global show
    global:
    numa_range: 0
    max_intf: 200
    discovery: 0
    drop_asym_route: 0
    retry_count: 3
    transaction_timeout: 10
    health_sensitivity: 100
    recovery_interval: 1
    router_sensitivity: 100
    [root@cannonlnet07 ~]# ibstat
    CA 'mlx5_0'
    CA type: MT4119
    Number of ports: 1
    Firmware version: 16.26.1040
    Hardware version: 0
    Node GUID: 0x98039b0300907de0
    System image GUID: 0x98039b0300907de0
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 100
    Base lid: 1422
    LMC: 0
    SM lid: 1434
    Capability mask: 0x2651e848
    Port GUID: 0x98039b0300907de0
    Link layer: InfiniBand
    CA 'mlx5_1'
    CA type: MT4119
    Number of ports: 1
    Firmware version: 16.26.1040
    Hardware version: 0
    Node GUID: 0x98039b0300907de1
    System image GUID: 0x98039b0300907de0
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 56
    Base lid: 2259
    LMC: 0
    SM lid: 158
    Capability mask: 0x2651e848
    Port GUID: 0x98039b0300907de1
    Link layer: InfiniBand
    [root@cannonlnet07 ~]# ifconfig ib0
    ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
    inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
    inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
    infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
    RX packets 343090 bytes 34310200 (32.7 MiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 112049 bytes 6723124 (6.4 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@cannonlnet07 ~]# ifconfig ib1
ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 495909 bytes 50980886 (48.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 18846 bytes 1130904 (1.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0



 Comments   
Comment by Michael Ethier (Inactive) [ 17/Mar/20 ]

Also these systems are using Mellanox OFED stack installed on all 8 of these is the following:

[root@cannonlnet07 ~]# ofed_info
MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0):
ar_mgr:
osm_plugins/ar_mgr/ar_mgr-1.0-0.45.MLNX20190923.g5aec6dc.tar.gz

cc_mgr:
osm_plugins/cc_mgr/cc_mgr-1.0-0.44.MLNX20190923.g5aec6dc.tar.gz

dapl:
dapl.git mlnx_ofed_4_0
commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d
dump_pr:
osm_plugins/dump_pr//dump_pr-1.0-0.40.MLNX20190923.g5aec6dc.tar.gz

fabric-collector:
fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz

gpio-mlxbf:
mlnx_ofed_soc/gpio-mlxbf-1.0-0.g6d44a8a.src.rpm

hcoll:
mlnx_ofed_hcol/hcoll-4.4.2938-1.src.rpm

i2c-mlx:
mlnx_ofed_soc/i2c-mlx-1.0-0.gab579c6.src.rpm

ibacm:
mlnx_ofed/ibacm.git mlnx_ofed_4_1
commit 4ae5d193f628c71bac481218bc88d0f77f7eff9a
ibdump:
sniffer/sniffer-5.0.0-3/ibdump/linux/ibdump-5.0.0-3.tgz

ibsim:
mlnx_ofed_ibsim/ibsim-0.7mlnx1-0.11.g85c342b.tar.gz

ibutils:
ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz

ibutils2:
ibutils2/ibutils2-2.1.1-0.110.MLNX20190922.gd4efc48.tar.gz

infiniband-diags:
mlnx_ofed_infiniband_diags/infiniband-diags-5.4.0.MLNX20190908.5f40e4f.tar.gz

iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

kernel-mft:
mlnx_ofed_mft/kernel-mft-4.13.0-102.src.rpm

knem:
knem.git mellanox-master
commit 8d875eed562b1df72365302881f3aab1b33b23b8
libdisni:
upstream/disni.git master
commit b50eee08b66b14b02594ee485cdc7dfd9820bbae
libibcm:
mlnx_ofed/libibcm.git mlnx_ofed_4_1
commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e
libibmad:
mlnx_ofed_libibmad/libibmad-5.4.0.MLNX20190423.1d917ae.tar.gz

libibumad:
mlnx_ofed_libibumad/libibumad-43.1.1.MLNX20190905.1080879.tar.gz

libibverbs:
mlnx_ofed/libibverbs.git mlnx_ofed_4_7
commit fc0a8c1ccb6f883b5ed321f1d17fb9f3b4f85bef
libmlx4:
mlnx_ofed/libmlx4.git mlnx_ofed_4_7
commit 819cf8a7fc4ec35065659b97035159bb11128bb4
libmlx5:
mlnx_ofed/libmlx5.git mlnx_ofed_4_7
commit aaa125580c1d20b824569ad18df71c2a051f2697
libpka:
mlnx_ofed_soc/libpka-1.0-1.g6cc68a2.src.rpm

librdmacm:
mlnx_ofed/librdmacm.git mlnx_ofed_4_7
commit 9cac524f5dbba911c2927b20e7137fb0e2f1b995
librxe:
mlnx_ofed/librxe.git master
commit afd75c6164324ac68a3e39f3c988ce85978669d4
libvma:
vma/source_rpms//libvma-8.9.4-0.src.rpm

mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

mlnx-ethtool:
mlnx_ofed/ethtool.git mlnx_ofed_4_7
commit 5b82426a01b5d76d6f5a1c58e1e419323ab29eaf
mlnx-iproute2:
mlnx_ofed/iproute2.git mlnx_ofed_4_7
commit 1c4837862432f8d4a7fdd4bc8f598d7ebab354c3
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

mlnx-rdma-rxe:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

mlx-bootctl:
mlnx_ofed_soc/mlx-bootctl-1.1-0.g5b90483.src.rpm

mlx-l3cache:
mlnx_ofed_soc/mlx-l3cache-0.1-1.gebb0728.src.rpm

mlx-pmc:
mlnx_ofed_soc/mlx-pmc-1.1-0.g1141c2e.src.rpm

mlx-trio:
mlnx_ofed_soc/mlx-trio-0.1-1.g9d13513.src.rpm

mlxbf-livefish:
mlnx_ofed_soc/mlxbf-livefish-1.0-0.gec08328.src.rpm

mpi-selector:
ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm

mpitests:
mlnx_ofed_mpitest/mpitests-3.2.20-e1a0676.src.rpm

mstflint:
mlnx_ofed_mstflint/mstflint-4.13.0-1.41.g4e8819c.tar.gz

multiperf:
mlnx_ofed_multiperf/multiperf-3.0-0.13.gcdaa426.tar.gz

mxm:
mlnx_ofed_mxm/mxm-3.7.3112-1.src.rpm

nvme-snap:
nvme/nvme-snap-2.0.1-2.mlnx.src.rpm

ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

openmpi:
mlnx_ofed_ompi_1.8/openmpi-4.0.2rc3-1.src.rpm

opensm:
mlnx_ofed_opensm/opensm-5.5.0.MLNX20190923.1c78385.tar.gz

openvswitch:
openvswitch.git mlnx_ofed_4_7
commit 5f4b5233413b20960336fa06ba309b53e9a99ae5
perftest:
mlnx_ofed_perftest/perftest-4.4-0.8.g7af08be.tar.gz

pka-mlxbf:
mlnx_ofed_soc/pka-mlxbf-1.0-0.gd3455da.src.rpm

qperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-4.6-1.0.1/SRPMS/qperf-0.4.9-9.46101.src.rpm

rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_4_7
commit 87e9049a32f7cda5f5aa36145400fb660f4dbc18
rshim:
mlnx_ofed_soc/rshim-1.8-0.g463f780.src.rpm

sharp:
mlnx_ofed_sharp/sharp-2.0.0.MLNX20190922.a9ebf22.tar.gz

sockperf:
sockperf/sockperf-3.6-0.git737d1e3e5576.src.rpm

srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

srptools:
srptools/srptools-41mlnx1-5.src.rpm

tmfifo:
mlnx_ofed_soc/tmfifo-1.3-0.g6aa30c7.src.rpm

ucx:
mlnx_ofed_ucx/ucx-1.7.0-1.src.rpm

Installed Packages:
-------------------

kmod-srp
libibverbs-utils
libibcm-devel
ibacm
dapl
infiniband-diags
qperf
ucx-rdmacm
knem
libibverbs
librxe
libibmad-devel
opensm
mstflint
ar_mgr
ucx-cma
mlnx-iproute2
kmod-mlnx-ofa_kernel
kmod-knem
kmod-rshim
libibverbs-devel
libmlx4-devel
librxe-devel-static
libibumad-devel
libibmad-static
librdmacm-utils
opensm-devel
dapl-devel-static
ibutils
ibdump
ucx
ucx-ib
hcoll
mlnxofed-docs
mlnx-ofa_kernel-devel
kmod-iser
mpi-selector
libibverbs-devel-static
libmlx5
libibcm
libibumad-static
ibsim
librdmacm-devel
opensm-static
dapl-utils
srptools
cc_mgr
infiniband-diags-compat
ucx-devel
ucx-ib-cm
openmpi
mpitests_openmpi
kmod-kernel-mft-mlnx
libmlx5-devel
libibmad
opensm-libs
perftest
dump_pr
sharp
mlnx-ethtool
mlnx-ofa_kernel
kmod-isert
libmlx4
libibumad
librdmacm
dapl-devel
ibutils2
mxm
ucx-knem

Comment by Amir Shehata (Inactive) [ 17/Mar/20 ]

These messages have been reduced in severity. Here is the patch:

LU-13071 lnet: reduce log severity for health events

Would be interesting to know what happens at the 10 minute interval, though.

Can you share the output of "lnetctl stats show"

Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?

 

Comment by Michael Ethier (Inactive) [ 17/Mar/20 ]

[root@cannonlnet07 ~]# lnetctl stats show
statistics:
msgs_alloc: 1518
msgs_max: 16396
rst_alloc: 568
errors: 0
send_count: 9398737
resend_count: 12378
response_timeout_count: 28678
local_interrupt_count: 0
local_dropped_count: 24613
local_aborted_count: 0
local_no_route_count: 0
local_timeout_count: 14188
local_error_count: 0
remote_dropped_count: 3862
remote_error_count: 0
remote_timeout_count: 0
network_timeout_count: 0
recv_count: 9398737
route_count: 2778855491
drop_count: 50252
send_length: 1052296560
recv_length: 283792
route_length: 127811084565569
drop_length: 24066088

Last 200 lines of lnet messages, I don't see anything of big interest, do you ?
[root@cannonlnet07 ~]# tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet
Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds
Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages
Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages
Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message
Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages

Sure I will enable debug and send the logs

Comment by Michael Ethier (Inactive) [ 17/Mar/20 ]

Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?

what are the additional commands to capture the "logs" ?

Comment by Amir Shehata (Inactive) [ 17/Mar/20 ]

Before we capture the logs, can we try the below recommendation and monitor the errors.

I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code.

I see that you have: transaction_timeout: 10 and retry_count: 3

That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is:

LU-13145 lnet: use conservative health timeouts

You can set it manually:

lnetctl set transaction_timeout 50 
lnetctl set retry_count 2
Comment by Michael Ethier (Inactive) [ 17/Mar/20 ]

I set those 2 parameters and I still see the recovery messages:

[root@cannonlnet07 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 2
transaction_timeout: 50
health_sensitivity: 100
recovery_interval: 1
router_sensitivity: 100

Comment by Amir Shehata (Inactive) [ 17/Mar/20 ]

I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100.

If you still see the problem, I would:

lctl set_param debug=+net
lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G (if you have the space)
# wait until the problem happens
lctl debug_daemon stop
lctl set_param debug=-net
lctl debug_file lustre.dk lustre.log
# attach or upload the lustre.log file depending on how big the file is

Here is the relevant lustre manual section for debug_daemon commands:

37.2.3.1. lctl debug_daemon Commands

https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600

Comment by Michael Ethier (Inactive) [ 18/Mar/20 ]

Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below.

Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max)
Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages
Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread...

lustre.log.gz

Comment by Michael Ethier (Inactive) [ 19/Mar/20 ]

Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ?
Thanks,
Mike

Comment by Amir Shehata (Inactive) [ 19/Mar/20 ]

So it looks like there are a couple of nodes which are causing all the problems:

00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed
00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed

Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message.

The reason connection to these NIDs are failing is due to:

(o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110

We're trying to resolve them and we're timing out.

Are these nodes "real"? Are they left over configuration?

Comment by Michael Ethier (Inactive) [ 19/Mar/20 ]

Hi,
Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn't cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won't be found in the /var/log/messages I assume.
Mike

[root@holy2c18110 ~]# ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
inet 10.31.167.172 netmask 255.255.240.0 broadcast 10.31.175.255
inet6 fe80::ee0d:9a03:12:8991 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 1024 (InfiniBand)
RX packets 687361 bytes 39139716 (37.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 7705 bytes 654574 (639.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@holy2c18110 ~]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.42.5000
Hardware version: 1
Node GUID: 0xec0d9a0300128990
System image GUID: 0xec0d9a0300128993
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 832
LMC: 0
SM lid: 158
Capability mask: 0x02514868
Port GUID: 0xec0d9a0300128991
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xec0d9a0300128992
Link layer: InfiniBand
[root@holy2c18110 ~]# uptime
17:24:26 up 1 day, 7:05, 1 user, load average: 0.00, 0.01, 0.05

Comment by Amir Shehata (Inactive) [ 19/Mar/20 ]

That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message.

The address resolution error is only seen when net logging is turned on.

Comment by Michael Ethier (Inactive) [ 20/Mar/20 ]

Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34.
Mike

Comment by Peter Jones [ 20/Mar/20 ]

My recommendation would be to use a 2.12.x release. If there is a bug fix missing from the 2.12.x branch we can include that in 2.12.5. ashehata do you agree?

Comment by Amir Shehata (Inactive) [ 20/Mar/20 ]

Sure. If there is no need for the features in 2.13, then the latest 2.12.x would suffice.

Comment by Michael Ethier (Inactive) [ 24/Mar/20 ]

Thanks for the help. I will let others on my team know about the 2.12 vs 2.13 recommendations.

Comment by John Hammond [ 15/Oct/20 ]

Both changes referenced above (LU-13071 and LU-13145) are in b2_12.

Generated at Sat Feb 10 03:00:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.