<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:00:40 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13367] lnet_handle_local_failure messages every 10 min ?</title>
                <link>https://jira.whamcloud.com/browse/LU-13367</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello,&lt;br/&gt;
We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn&apos;t seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.&lt;br/&gt;
Thanks,&lt;br/&gt;
Mike&lt;/p&gt;


&lt;p&gt;Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages&lt;br/&gt;
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages&lt;br/&gt;
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;/p&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# nslookup 10.31.160.253&lt;br/&gt;
253.160.31.10.in-addr.arpa	name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# nslookup 10.31.179.178&lt;br/&gt;
178.179.31.10.in-addr.arpa	name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# more /etc/modprobe.d/lustre.conf &lt;br/&gt;
options lnet networks=&quot;o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)&quot;&lt;br/&gt;
options lnet forwarding=&quot;enabled&quot;&lt;br/&gt;
options lnet lnet_peer_discovery_disabled=1&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# lnetctl net show&lt;br/&gt;
net:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;net type: lo&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 0@lo&lt;br/&gt;
          status: up&lt;/li&gt;
	&lt;li&gt;net type: o2ib&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 10.31.160.253@o2ib&lt;br/&gt;
          status: up&lt;br/&gt;
          interfaces:&lt;br/&gt;
              0: ib1&lt;/li&gt;
	&lt;li&gt;net type: o2ib2&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 10.31.160.253@o2ib2&lt;br/&gt;
          status: up&lt;br/&gt;
          interfaces:&lt;br/&gt;
              0: ib1&lt;/li&gt;
	&lt;li&gt;net type: o2ib4&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 10.31.179.178@o2ib4&lt;br/&gt;
          status: up&lt;br/&gt;
          interfaces:&lt;br/&gt;
              0: ib0&lt;/li&gt;
	&lt;li&gt;net type: tcp&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 10.31.8.93@tcp&lt;br/&gt;
          status: down&lt;br/&gt;
          interfaces:&lt;br/&gt;
              0: bond0&lt;/li&gt;
	&lt;li&gt;net type: tcp4&lt;br/&gt;
      local NI(s):&lt;/li&gt;
	&lt;li&gt;nid: 10.31.73.39@tcp4&lt;br/&gt;
          status: down&lt;br/&gt;
          interfaces:&lt;br/&gt;
              0: bond0.2475&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# lnetctl stats show&lt;br/&gt;
statistics:&lt;br/&gt;
    msgs_alloc: 1517&lt;br/&gt;
    msgs_max: 16396&lt;br/&gt;
    rst_alloc: 568&lt;br/&gt;
    errors: 0&lt;br/&gt;
    send_count: 9287639&lt;br/&gt;
    resend_count: 12378&lt;br/&gt;
    response_timeout_count: 28115&lt;br/&gt;
    local_interrupt_count: 0&lt;br/&gt;
    local_dropped_count: 24050&lt;br/&gt;
    local_aborted_count: 0&lt;br/&gt;
    local_no_route_count: 0&lt;br/&gt;
    local_timeout_count: 14188&lt;br/&gt;
    local_error_count: 0&lt;br/&gt;
    remote_dropped_count: 3862&lt;br/&gt;
    remote_error_count: 0&lt;br/&gt;
    remote_timeout_count: 0&lt;br/&gt;
    network_timeout_count: 0&lt;br/&gt;
    recv_count: 9287639&lt;br/&gt;
    route_count: 2744617426&lt;br/&gt;
    drop_count: 50252&lt;br/&gt;
    send_length: 1039854144&lt;br/&gt;
    recv_length: 283232&lt;br/&gt;
    route_length: 125943144442551&lt;br/&gt;
    drop_length: 24066088&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# lnetctl global show&lt;br/&gt;
global:&lt;br/&gt;
    numa_range: 0&lt;br/&gt;
    max_intf: 200&lt;br/&gt;
    discovery: 0&lt;br/&gt;
    drop_asym_route: 0&lt;br/&gt;
    retry_count: 3&lt;br/&gt;
    transaction_timeout: 10&lt;br/&gt;
    health_sensitivity: 100&lt;br/&gt;
    recovery_interval: 1&lt;br/&gt;
    router_sensitivity: 100&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# ibstat&lt;br/&gt;
CA &apos;mlx5_0&apos;&lt;br/&gt;
	CA type: MT4119&lt;br/&gt;
	Number of ports: 1&lt;br/&gt;
	Firmware version: 16.26.1040&lt;br/&gt;
	Hardware version: 0&lt;br/&gt;
	Node GUID: 0x98039b0300907de0&lt;br/&gt;
	System image GUID: 0x98039b0300907de0&lt;br/&gt;
	Port 1:&lt;br/&gt;
		State: Active&lt;br/&gt;
		Physical state: LinkUp&lt;br/&gt;
		Rate: 100&lt;br/&gt;
		Base lid: 1422&lt;br/&gt;
		LMC: 0&lt;br/&gt;
		SM lid: 1434&lt;br/&gt;
		Capability mask: 0x2651e848&lt;br/&gt;
		Port GUID: 0x98039b0300907de0&lt;br/&gt;
		Link layer: InfiniBand&lt;br/&gt;
CA &apos;mlx5_1&apos;&lt;br/&gt;
	CA type: MT4119&lt;br/&gt;
	Number of ports: 1&lt;br/&gt;
	Firmware version: 16.26.1040&lt;br/&gt;
	Hardware version: 0&lt;br/&gt;
	Node GUID: 0x98039b0300907de1&lt;br/&gt;
	System image GUID: 0x98039b0300907de0&lt;br/&gt;
	Port 1:&lt;br/&gt;
		State: Active&lt;br/&gt;
		Physical state: LinkUp&lt;br/&gt;
		Rate: 56&lt;br/&gt;
		Base lid: 2259&lt;br/&gt;
		LMC: 0&lt;br/&gt;
		SM lid: 158&lt;br/&gt;
		Capability mask: 0x2651e848&lt;br/&gt;
		Port GUID: 0x98039b0300907de1&lt;br/&gt;
		Link layer: InfiniBand&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# ifconfig ib0&lt;br/&gt;
ib0: flags=4163&amp;lt;UP,BROADCAST,RUNNING,MULTICAST&amp;gt;  mtu 2044&lt;br/&gt;
        inet 10.31.179.178  netmask 255.255.240.0  broadcast 10.31.191.255&lt;br/&gt;
        inet6 fe80::9a03:9b03:90:7de0  prefixlen 64  scopeid 0x20&amp;lt;link&amp;gt;&lt;br/&gt;
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).&lt;br/&gt;
        infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)&lt;br/&gt;
        RX packets 343090  bytes 34310200 (32.7 MiB)&lt;br/&gt;
        RX errors 0  dropped 0  overruns 0  frame 0&lt;br/&gt;
        TX packets 112049  bytes 6723124 (6.4 MiB)&lt;br/&gt;
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# ifconfig ib1&lt;br/&gt;
ib1: flags=4163&amp;lt;UP,BROADCAST,RUNNING,MULTICAST&amp;gt;  mtu 2044&lt;br/&gt;
        inet 10.31.160.253  netmask 255.255.240.0  broadcast 10.31.175.255&lt;br/&gt;
        inet6 fe80::9a03:9b03:90:7de1  prefixlen 64  scopeid 0x20&amp;lt;link&amp;gt;&lt;br/&gt;
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).&lt;br/&gt;
        infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)&lt;br/&gt;
        RX packets 495909  bytes 50980886 (48.6 MiB)&lt;br/&gt;
        RX errors 0  dropped 0  overruns 0  frame 0&lt;br/&gt;
        TX packets 18846  bytes 1130904 (1.0 MiB)&lt;br/&gt;
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0&lt;/p&gt;</description>
                <environment>Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.</environment>
        <key id="58408">LU-13367</key>
            <summary>lnet_handle_local_failure messages every 10 min ?</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="mre64">Michael Ethier</reporter>
                        <labels>
                    </labels>
                <created>Tue, 17 Mar 2020 18:55:39 +0000</created>
                <updated>Thu, 15 Oct 2020 13:28:06 +0000</updated>
                            <resolved>Thu, 15 Oct 2020 13:28:02 +0000</resolved>
                                    <version>Lustre 2.13.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="265483" author="mre64" created="Tue, 17 Mar 2020 19:17:30 +0000"  >&lt;p&gt;Also these systems are using Mellanox OFED stack installed on all 8 of these is the following:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# ofed_info &lt;br/&gt;
MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0):&lt;br/&gt;
ar_mgr:&lt;br/&gt;
osm_plugins/ar_mgr/ar_mgr-1.0-0.45.MLNX20190923.g5aec6dc.tar.gz&lt;/p&gt;

&lt;p&gt;cc_mgr:&lt;br/&gt;
osm_plugins/cc_mgr/cc_mgr-1.0-0.44.MLNX20190923.g5aec6dc.tar.gz&lt;/p&gt;

&lt;p&gt;dapl:&lt;br/&gt;
dapl.git mlnx_ofed_4_0&lt;br/&gt;
commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d&lt;br/&gt;
dump_pr:&lt;br/&gt;
osm_plugins/dump_pr//dump_pr-1.0-0.40.MLNX20190923.g5aec6dc.tar.gz&lt;/p&gt;

&lt;p&gt;fabric-collector:&lt;br/&gt;
fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz&lt;/p&gt;

&lt;p&gt;gpio-mlxbf:&lt;br/&gt;
mlnx_ofed_soc/gpio-mlxbf-1.0-0.g6d44a8a.src.rpm&lt;/p&gt;

&lt;p&gt;hcoll:&lt;br/&gt;
mlnx_ofed_hcol/hcoll-4.4.2938-1.src.rpm&lt;/p&gt;

&lt;p&gt;i2c-mlx:&lt;br/&gt;
mlnx_ofed_soc/i2c-mlx-1.0-0.gab579c6.src.rpm&lt;/p&gt;

&lt;p&gt;ibacm:&lt;br/&gt;
mlnx_ofed/ibacm.git mlnx_ofed_4_1&lt;br/&gt;
commit 4ae5d193f628c71bac481218bc88d0f77f7eff9a&lt;br/&gt;
ibdump:&lt;br/&gt;
sniffer/sniffer-5.0.0-3/ibdump/linux/ibdump-5.0.0-3.tgz&lt;/p&gt;

&lt;p&gt;ibsim:&lt;br/&gt;
mlnx_ofed_ibsim/ibsim-0.7mlnx1-0.11.g85c342b.tar.gz&lt;/p&gt;

&lt;p&gt;ibutils:&lt;br/&gt;
ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz&lt;/p&gt;

&lt;p&gt;ibutils2:&lt;br/&gt;
ibutils2/ibutils2-2.1.1-0.110.MLNX20190922.gd4efc48.tar.gz&lt;/p&gt;

&lt;p&gt;infiniband-diags:&lt;br/&gt;
mlnx_ofed_infiniband_diags/infiniband-diags-5.4.0.MLNX20190908.5f40e4f.tar.gz&lt;/p&gt;

&lt;p&gt;iser:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;isert:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;kernel-mft:&lt;br/&gt;
mlnx_ofed_mft/kernel-mft-4.13.0-102.src.rpm&lt;/p&gt;

&lt;p&gt;knem:&lt;br/&gt;
knem.git mellanox-master&lt;br/&gt;
commit 8d875eed562b1df72365302881f3aab1b33b23b8&lt;br/&gt;
libdisni:&lt;br/&gt;
upstream/disni.git master&lt;br/&gt;
commit b50eee08b66b14b02594ee485cdc7dfd9820bbae&lt;br/&gt;
libibcm:&lt;br/&gt;
mlnx_ofed/libibcm.git mlnx_ofed_4_1&lt;br/&gt;
commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e&lt;br/&gt;
libibmad:&lt;br/&gt;
mlnx_ofed_libibmad/libibmad-5.4.0.MLNX20190423.1d917ae.tar.gz&lt;/p&gt;

&lt;p&gt;libibumad:&lt;br/&gt;
mlnx_ofed_libibumad/libibumad-43.1.1.MLNX20190905.1080879.tar.gz&lt;/p&gt;

&lt;p&gt;libibverbs:&lt;br/&gt;
mlnx_ofed/libibverbs.git mlnx_ofed_4_7&lt;br/&gt;
commit fc0a8c1ccb6f883b5ed321f1d17fb9f3b4f85bef&lt;br/&gt;
libmlx4:&lt;br/&gt;
mlnx_ofed/libmlx4.git mlnx_ofed_4_7&lt;br/&gt;
commit 819cf8a7fc4ec35065659b97035159bb11128bb4&lt;br/&gt;
libmlx5:&lt;br/&gt;
mlnx_ofed/libmlx5.git mlnx_ofed_4_7&lt;br/&gt;
commit aaa125580c1d20b824569ad18df71c2a051f2697&lt;br/&gt;
libpka:&lt;br/&gt;
mlnx_ofed_soc/libpka-1.0-1.g6cc68a2.src.rpm&lt;/p&gt;

&lt;p&gt;librdmacm:&lt;br/&gt;
mlnx_ofed/librdmacm.git mlnx_ofed_4_7&lt;br/&gt;
commit 9cac524f5dbba911c2927b20e7137fb0e2f1b995&lt;br/&gt;
librxe:&lt;br/&gt;
mlnx_ofed/librxe.git master&lt;br/&gt;
commit afd75c6164324ac68a3e39f3c988ce85978669d4&lt;br/&gt;
libvma:&lt;br/&gt;
vma/source_rpms//libvma-8.9.4-0.src.rpm&lt;/p&gt;

&lt;p&gt;mlnx-en:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;mlnx-ethtool:&lt;br/&gt;
mlnx_ofed/ethtool.git mlnx_ofed_4_7&lt;br/&gt;
commit 5b82426a01b5d76d6f5a1c58e1e419323ab29eaf&lt;br/&gt;
mlnx-iproute2:&lt;br/&gt;
mlnx_ofed/iproute2.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4837862432f8d4a7fdd4bc8f598d7ebab354c3&lt;br/&gt;
mlnx-nfsrdma:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;mlnx-nvme:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;mlnx-ofa_kernel:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;mlnx-rdma-rxe:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;mlx-bootctl:&lt;br/&gt;
mlnx_ofed_soc/mlx-bootctl-1.1-0.g5b90483.src.rpm&lt;/p&gt;

&lt;p&gt;mlx-l3cache:&lt;br/&gt;
mlnx_ofed_soc/mlx-l3cache-0.1-1.gebb0728.src.rpm&lt;/p&gt;

&lt;p&gt;mlx-pmc:&lt;br/&gt;
mlnx_ofed_soc/mlx-pmc-1.1-0.g1141c2e.src.rpm&lt;/p&gt;

&lt;p&gt;mlx-trio:&lt;br/&gt;
mlnx_ofed_soc/mlx-trio-0.1-1.g9d13513.src.rpm&lt;/p&gt;

&lt;p&gt;mlxbf-livefish:&lt;br/&gt;
mlnx_ofed_soc/mlxbf-livefish-1.0-0.gec08328.src.rpm&lt;/p&gt;

&lt;p&gt;mpi-selector:&lt;br/&gt;
ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm&lt;/p&gt;

&lt;p&gt;mpitests:&lt;br/&gt;
mlnx_ofed_mpitest/mpitests-3.2.20-e1a0676.src.rpm&lt;/p&gt;

&lt;p&gt;mstflint:&lt;br/&gt;
mlnx_ofed_mstflint/mstflint-4.13.0-1.41.g4e8819c.tar.gz&lt;/p&gt;

&lt;p&gt;multiperf:&lt;br/&gt;
mlnx_ofed_multiperf/multiperf-3.0-0.13.gcdaa426.tar.gz&lt;/p&gt;

&lt;p&gt;mxm:&lt;br/&gt;
mlnx_ofed_mxm/mxm-3.7.3112-1.src.rpm&lt;/p&gt;

&lt;p&gt;nvme-snap:&lt;br/&gt;
nvme/nvme-snap-2.0.1-2.mlnx.src.rpm&lt;/p&gt;

&lt;p&gt;ofed-docs:&lt;br/&gt;
docs.git mlnx_ofed-4.0&lt;br/&gt;
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc&lt;/p&gt;

&lt;p&gt;openmpi:&lt;br/&gt;
mlnx_ofed_ompi_1.8/openmpi-4.0.2rc3-1.src.rpm&lt;/p&gt;

&lt;p&gt;opensm:&lt;br/&gt;
mlnx_ofed_opensm/opensm-5.5.0.MLNX20190923.1c78385.tar.gz&lt;/p&gt;

&lt;p&gt;openvswitch:&lt;br/&gt;
openvswitch.git mlnx_ofed_4_7&lt;br/&gt;
commit 5f4b5233413b20960336fa06ba309b53e9a99ae5&lt;br/&gt;
perftest:&lt;br/&gt;
mlnx_ofed_perftest/perftest-4.4-0.8.g7af08be.tar.gz&lt;/p&gt;

&lt;p&gt;pka-mlxbf:&lt;br/&gt;
mlnx_ofed_soc/pka-mlxbf-1.0-0.gd3455da.src.rpm&lt;/p&gt;

&lt;p&gt;qperf:&lt;br/&gt;
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-4.6-1.0.1/SRPMS/qperf-0.4.9-9.46101.src.rpm&lt;/p&gt;

&lt;p&gt;rdma-core:&lt;br/&gt;
mlnx_ofed/rdma-core.git mlnx_ofed_4_7&lt;br/&gt;
commit 87e9049a32f7cda5f5aa36145400fb660f4dbc18&lt;br/&gt;
rshim:&lt;br/&gt;
mlnx_ofed_soc/rshim-1.8-0.g463f780.src.rpm&lt;/p&gt;

&lt;p&gt;sharp:&lt;br/&gt;
mlnx_ofed_sharp/sharp-2.0.0.MLNX20190922.a9ebf22.tar.gz&lt;/p&gt;

&lt;p&gt;sockperf:&lt;br/&gt;
sockperf/sockperf-3.6-0.git737d1e3e5576.src.rpm&lt;/p&gt;

&lt;p&gt;srp:&lt;br/&gt;
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7&lt;br/&gt;
commit 1c4bf4249c363d1ab555dcc246f88d99cae96175&lt;/p&gt;

&lt;p&gt;srptools:&lt;br/&gt;
srptools/srptools-41mlnx1-5.src.rpm&lt;/p&gt;

&lt;p&gt;tmfifo:&lt;br/&gt;
mlnx_ofed_soc/tmfifo-1.3-0.g6aa30c7.src.rpm&lt;/p&gt;

&lt;p&gt;ucx:&lt;br/&gt;
mlnx_ofed_ucx/ucx-1.7.0-1.src.rpm&lt;/p&gt;


&lt;p&gt;Installed Packages:&lt;br/&gt;
-------------------&lt;/p&gt;

&lt;p&gt;kmod-srp&lt;br/&gt;
libibverbs-utils&lt;br/&gt;
libibcm-devel&lt;br/&gt;
ibacm&lt;br/&gt;
dapl&lt;br/&gt;
infiniband-diags&lt;br/&gt;
qperf&lt;br/&gt;
ucx-rdmacm&lt;br/&gt;
knem&lt;br/&gt;
libibverbs&lt;br/&gt;
librxe&lt;br/&gt;
libibmad-devel&lt;br/&gt;
opensm&lt;br/&gt;
mstflint&lt;br/&gt;
ar_mgr&lt;br/&gt;
ucx-cma&lt;br/&gt;
mlnx-iproute2&lt;br/&gt;
kmod-mlnx-ofa_kernel&lt;br/&gt;
kmod-knem&lt;br/&gt;
kmod-rshim&lt;br/&gt;
libibverbs-devel&lt;br/&gt;
libmlx4-devel&lt;br/&gt;
librxe-devel-static&lt;br/&gt;
libibumad-devel&lt;br/&gt;
libibmad-static&lt;br/&gt;
librdmacm-utils&lt;br/&gt;
opensm-devel&lt;br/&gt;
dapl-devel-static&lt;br/&gt;
ibutils&lt;br/&gt;
ibdump&lt;br/&gt;
ucx&lt;br/&gt;
ucx-ib&lt;br/&gt;
hcoll&lt;br/&gt;
mlnxofed-docs&lt;br/&gt;
mlnx-ofa_kernel-devel&lt;br/&gt;
kmod-iser&lt;br/&gt;
mpi-selector&lt;br/&gt;
libibverbs-devel-static&lt;br/&gt;
libmlx5&lt;br/&gt;
libibcm&lt;br/&gt;
libibumad-static&lt;br/&gt;
ibsim&lt;br/&gt;
librdmacm-devel&lt;br/&gt;
opensm-static&lt;br/&gt;
dapl-utils&lt;br/&gt;
srptools&lt;br/&gt;
cc_mgr&lt;br/&gt;
infiniband-diags-compat&lt;br/&gt;
ucx-devel&lt;br/&gt;
ucx-ib-cm&lt;br/&gt;
openmpi&lt;br/&gt;
mpitests_openmpi&lt;br/&gt;
kmod-kernel-mft-mlnx&lt;br/&gt;
libmlx5-devel&lt;br/&gt;
libibmad&lt;br/&gt;
opensm-libs&lt;br/&gt;
perftest&lt;br/&gt;
dump_pr&lt;br/&gt;
sharp&lt;br/&gt;
mlnx-ethtool&lt;br/&gt;
mlnx-ofa_kernel&lt;br/&gt;
kmod-isert&lt;br/&gt;
libmlx4&lt;br/&gt;
libibumad&lt;br/&gt;
librdmacm&lt;br/&gt;
dapl-devel&lt;br/&gt;
ibutils2&lt;br/&gt;
mxm&lt;br/&gt;
ucx-knem&lt;/p&gt;</comment>
                            <comment id="265484" author="ashehata" created="Tue, 17 Mar 2020 19:26:55 +0000"  >&lt;p&gt;These messages have been reduced in severity. Here is the patch:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
LU-13071 lnet: reduce log severity &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; health events&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Would be interesting to know what happens at the 10 minute interval, though.&lt;/p&gt;

&lt;p&gt;Can you share the output of &quot;lnetctl stats show&quot;&lt;/p&gt;

&lt;p&gt;Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="265485" author="mre64" created="Tue, 17 Mar 2020 20:01:12 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# lnetctl stats show&lt;br/&gt;
statistics:&lt;br/&gt;
    msgs_alloc: 1518&lt;br/&gt;
    msgs_max: 16396&lt;br/&gt;
    rst_alloc: 568&lt;br/&gt;
    errors: 0&lt;br/&gt;
    send_count: 9398737&lt;br/&gt;
    resend_count: 12378&lt;br/&gt;
    response_timeout_count: 28678&lt;br/&gt;
    local_interrupt_count: 0&lt;br/&gt;
    local_dropped_count: 24613&lt;br/&gt;
    local_aborted_count: 0&lt;br/&gt;
    local_no_route_count: 0&lt;br/&gt;
    local_timeout_count: 14188&lt;br/&gt;
    local_error_count: 0&lt;br/&gt;
    remote_dropped_count: 3862&lt;br/&gt;
    remote_error_count: 0&lt;br/&gt;
    remote_timeout_count: 0&lt;br/&gt;
    network_timeout_count: 0&lt;br/&gt;
    recv_count: 9398737&lt;br/&gt;
    route_count: 2778855491&lt;br/&gt;
    drop_count: 50252&lt;br/&gt;
    send_length: 1052296560&lt;br/&gt;
    recv_length: 283792&lt;br/&gt;
    route_length: 127811084565569&lt;br/&gt;
    drop_length: 24066088&lt;/p&gt;

&lt;p&gt;Last 200 lines of lnet messages, I don&apos;t see anything of big interest, do you ?&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet&lt;br/&gt;
Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages&lt;br/&gt;
Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds&lt;br/&gt;
Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages&lt;br/&gt;
Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds&lt;br/&gt;
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages&lt;br/&gt;
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages&lt;br/&gt;
Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds&lt;br/&gt;
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 14:33:40 cannonlnet07 dbus&lt;span class=&quot;error&quot;&gt;&amp;#91;1267&amp;#93;&lt;/span&gt;: &lt;span class=&quot;error&quot;&gt;&amp;#91;system&amp;#93;&lt;/span&gt; Activating service name=&apos;org.freedesktop.problems&apos; (using servicehelper)&lt;br/&gt;
Mar 17 14:33:40 cannonlnet07 dbus&lt;span class=&quot;error&quot;&gt;&amp;#91;1267&amp;#93;&lt;/span&gt;: &lt;span class=&quot;error&quot;&gt;&amp;#91;system&amp;#93;&lt;/span&gt; Successfully activated service &apos;org.freedesktop.problems&apos;&lt;br/&gt;
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages&lt;br/&gt;
Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages&lt;br/&gt;
Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages&lt;br/&gt;
Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4&lt;br/&gt;
Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages&lt;br/&gt;
Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4&lt;br/&gt;
Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message&lt;br/&gt;
Mar 17 15:56:14 cannonlnet07 dbus&lt;span class=&quot;error&quot;&gt;&amp;#91;1267&amp;#93;&lt;/span&gt;: &lt;span class=&quot;error&quot;&gt;&amp;#91;system&amp;#93;&lt;/span&gt; Activating service name=&apos;org.freedesktop.problems&apos; (using servicehelper)&lt;br/&gt;
Mar 17 15:56:14 cannonlnet07 dbus&lt;span class=&quot;error&quot;&gt;&amp;#91;1267&amp;#93;&lt;/span&gt;: &lt;span class=&quot;error&quot;&gt;&amp;#91;system&amp;#93;&lt;/span&gt; Successfully activated service &apos;org.freedesktop.problems&apos;&lt;br/&gt;
Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages&lt;/p&gt;

&lt;p&gt;Sure I will enable debug and send the logs&lt;/p&gt;</comment>
                            <comment id="265486" author="mre64" created="Tue, 17 Mar 2020 20:26:56 +0000"  >&lt;p&gt;Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?&lt;/p&gt;

&lt;p&gt;what are the additional commands to capture the &quot;logs&quot; ?&lt;/p&gt;</comment>
                            <comment id="265487" author="ashehata" created="Tue, 17 Mar 2020 20:28:37 +0000"  >&lt;p&gt;Before we capture the logs, can we try the below recommendation and monitor the errors.&lt;/p&gt;

&lt;p&gt;I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code.&lt;/p&gt;

&lt;p&gt;I see that you have: transaction_timeout: 10 and retry_count: 3&lt;/p&gt;

&lt;p&gt;That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
LU-13145 lnet: use conservative health timeouts&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You can set it manually:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl set transaction_timeout 50 
lnetctl set retry_count 2&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="265491" author="mre64" created="Tue, 17 Mar 2020 21:45:21 +0000"  >&lt;p&gt;I set those 2 parameters and I still see the recovery messages:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet07 ~&amp;#93;&lt;/span&gt;# lnetctl global show&lt;br/&gt;
global:&lt;br/&gt;
    numa_range: 0&lt;br/&gt;
    max_intf: 200&lt;br/&gt;
    discovery: 0&lt;br/&gt;
    drop_asym_route: 0&lt;br/&gt;
    retry_count: 2&lt;br/&gt;
    transaction_timeout: 50&lt;br/&gt;
    health_sensitivity: 100&lt;br/&gt;
    recovery_interval: 1&lt;br/&gt;
    router_sensitivity: 100&lt;/p&gt;</comment>
                            <comment id="265498" author="ashehata" created="Tue, 17 Mar 2020 22:45:21 +0000"  >&lt;p&gt;I&apos;m wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100.&lt;/p&gt;

&lt;p&gt;If you still see the problem, I would:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl set_param debug=+net
lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G (&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; you have the space)
# wait until the problem happens
lctl debug_daemon stop
lctl set_param debug=-net
lctl debug_file lustre.dk lustre.log
# attach or upload the lustre.log file depending on how big the file is&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Here is the relevant lustre manual section for debug_daemon commands:&lt;/p&gt;

&lt;p&gt;37.2.3.1. lctl debug_daemon Commands&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="265508" author="mre64" created="Wed, 18 Mar 2020 00:38:18 +0000"  >&lt;p&gt;Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below.&lt;/p&gt;

&lt;p&gt;Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max)&lt;br/&gt;
Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900&lt;br/&gt;
Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages&lt;br/&gt;
Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread...&lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/34469/34469_lustre.log.gz&quot; title=&quot;lustre.log.gz attached to LU-13367&quot;&gt;lustre.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="265659" author="mre64" created="Thu, 19 Mar 2020 14:41:34 +0000"  >&lt;p&gt;Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ?&lt;br/&gt;
Thanks,&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="265679" author="ashehata" created="Thu, 19 Mar 2020 17:53:02 +0000"  >&lt;p&gt;So it looks like there are a couple of nodes which are causing all the problems:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.31.176.98@o2ib4: connection failed
00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.31.167.172@o2ib: connection failed&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that&apos;s when you see the message.&lt;/p&gt;

&lt;p&gt;The reason connection to these NIDs are failing is due to:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
(o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We&apos;re trying to resolve them and we&apos;re timing out.&lt;/p&gt;

&lt;p&gt;Are these nodes &quot;real&quot;? Are they left over configuration?&lt;/p&gt;</comment>
                            <comment id="265690" author="mre64" created="Thu, 19 Mar 2020 21:32:09 +0000"  >&lt;p&gt;Hi,&lt;br/&gt;
Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn&apos;t cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won&apos;t be found in the /var/log/messages I assume.&lt;br/&gt;
Mike&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c18110 ~&amp;#93;&lt;/span&gt;# ifconfig ib0&lt;br/&gt;
ib0: flags=4163&amp;lt;UP,BROADCAST,RUNNING,MULTICAST&amp;gt;  mtu 65520&lt;br/&gt;
        inet 10.31.167.172  netmask 255.255.240.0  broadcast 10.31.175.255&lt;br/&gt;
        inet6 fe80::ee0d:9a03:12:8991  prefixlen 64  scopeid 0x20&amp;lt;link&amp;gt;&lt;br/&gt;
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).&lt;br/&gt;
        infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)&lt;br/&gt;
        RX packets 687361  bytes 39139716 (37.3 MiB)&lt;br/&gt;
        RX errors 0  dropped 0  overruns 0  frame 0&lt;br/&gt;
        TX packets 7705  bytes 654574 (639.2 KiB)&lt;br/&gt;
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c18110 ~&amp;#93;&lt;/span&gt;# ibstat&lt;br/&gt;
CA &apos;mlx4_0&apos;&lt;br/&gt;
	CA type: MT4099&lt;br/&gt;
	Number of ports: 2&lt;br/&gt;
	Firmware version: 2.42.5000&lt;br/&gt;
	Hardware version: 1&lt;br/&gt;
	Node GUID: 0xec0d9a0300128990&lt;br/&gt;
	System image GUID: 0xec0d9a0300128993&lt;br/&gt;
	Port 1:&lt;br/&gt;
		State: Active&lt;br/&gt;
		Physical state: LinkUp&lt;br/&gt;
		Rate: 56&lt;br/&gt;
		Base lid: 832&lt;br/&gt;
		LMC: 0&lt;br/&gt;
		SM lid: 158&lt;br/&gt;
		Capability mask: 0x02514868&lt;br/&gt;
		Port GUID: 0xec0d9a0300128991&lt;br/&gt;
		Link layer: InfiniBand&lt;br/&gt;
	Port 2:&lt;br/&gt;
		State: Down&lt;br/&gt;
		Physical state: Polling&lt;br/&gt;
		Rate: 10&lt;br/&gt;
		Base lid: 0&lt;br/&gt;
		LMC: 0&lt;br/&gt;
		SM lid: 0&lt;br/&gt;
		Capability mask: 0x02514868&lt;br/&gt;
		Port GUID: 0xec0d9a0300128992&lt;br/&gt;
		Link layer: InfiniBand&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c18110 ~&amp;#93;&lt;/span&gt;# uptime&lt;br/&gt;
 17:24:26 up 1 day,  7:05,  1 user,  load average: 0.00, 0.01, 0.05&lt;/p&gt;</comment>
                            <comment id="265695" author="ashehata" created="Thu, 19 Mar 2020 22:23:28 +0000"  >&lt;p&gt;That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won&apos;t see it anymore. If the hosts coming up/down is expected, then there shouldn&apos;t be a problem, except for the noisiness of this message.&lt;/p&gt;

&lt;p&gt;The address resolution error is only seen when net logging is turned on.&lt;/p&gt;</comment>
                            <comment id="265710" author="mre64" created="Fri, 20 Mar 2020 01:32:44 +0000"  >&lt;p&gt;Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don&apos;t care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34.&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="265782" author="pjones" created="Fri, 20 Mar 2020 21:28:24 +0000"  >&lt;p&gt;My recommendation would be to use a 2.12.x release. If there is a bug fix missing from the 2.12.x branch we can include that in 2.12.5. &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ashehata&quot; class=&quot;user-hover&quot; rel=&quot;ashehata&quot;&gt;ashehata&lt;/a&gt; do you agree?&lt;/p&gt;</comment>
                            <comment id="265783" author="ashehata" created="Fri, 20 Mar 2020 22:40:55 +0000"  >&lt;p&gt;Sure. If there is no need for the features in 2.13, then the latest 2.12.x would suffice.&lt;/p&gt;</comment>
                            <comment id="266014" author="mre64" created="Tue, 24 Mar 2020 16:48:22 +0000"  >&lt;p&gt;Thanks for the help. I will let others on my team know about the 2.12 vs 2.13 recommendations.&lt;/p&gt;</comment>
                            <comment id="282312" author="jhammond" created="Thu, 15 Oct 2020 13:27:52 +0000"  >&lt;p&gt;Both changes referenced above (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13071&quot; title=&quot;LNet Health: reduce log severity&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13071&quot;&gt;&lt;del&gt;LU-13071&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13145&quot; title=&quot;LNet Health: increase transaction timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13145&quot;&gt;&lt;del&gt;LU-13145&lt;/del&gt;&lt;/a&gt;) are in &lt;tt&gt;b2_12&lt;/tt&gt;.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="34469" name="lustre.log.gz" size="35814303" author="mre64" created="Wed, 18 Mar 2020 00:38:28 +0000"/>
                            <attachment id="34468" name="lustre.log.gz" size="35814303" author="mre64" created="Wed, 18 Mar 2020 00:38:22 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>lustre-2.13</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00vpr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>