Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13367

lnet_handle_local_failure messages every 10 min ?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.
    • 3
    • 9223372036854775807

    Description

      Hello,
      We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
      Thanks,
      Mike

      Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages

      [root@cannonlnet07 ~]# nslookup 10.31.160.253
      253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# nslookup 10.31.179.178
      178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
      options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
      options lnet forwarding="enabled"
      options lnet lnet_peer_discovery_disabled=1
      [root@cannonlnet07 ~]# lnetctl net show
      net:

      • net type: lo
        local NI(s):
      • nid: 0@lo
        status: up
      • net type: o2ib
        local NI(s):
      • nid: 10.31.160.253@o2ib
        status: up
        interfaces:
        0: ib1
      • net type: o2ib2
        local NI(s):
      • nid: 10.31.160.253@o2ib2
        status: up
        interfaces:
        0: ib1
      • net type: o2ib4
        local NI(s):
      • nid: 10.31.179.178@o2ib4
        status: up
        interfaces:
        0: ib0
      • net type: tcp
        local NI(s):
      • nid: 10.31.8.93@tcp
        status: down
        interfaces:
        0: bond0
      • net type: tcp4
        local NI(s):
      • nid: 10.31.73.39@tcp4
        status: down
        interfaces:
        0: bond0.2475
        [root@cannonlnet07 ~]# lnetctl stats show
        statistics:
        msgs_alloc: 1517
        msgs_max: 16396
        rst_alloc: 568
        errors: 0
        send_count: 9287639
        resend_count: 12378
        response_timeout_count: 28115
        local_interrupt_count: 0
        local_dropped_count: 24050
        local_aborted_count: 0
        local_no_route_count: 0
        local_timeout_count: 14188
        local_error_count: 0
        remote_dropped_count: 3862
        remote_error_count: 0
        remote_timeout_count: 0
        network_timeout_count: 0
        recv_count: 9287639
        route_count: 2744617426
        drop_count: 50252
        send_length: 1039854144
        recv_length: 283232
        route_length: 125943144442551
        drop_length: 24066088
        [root@cannonlnet07 ~]# lnetctl global show
        global:
        numa_range: 0
        max_intf: 200
        discovery: 0
        drop_asym_route: 0
        retry_count: 3
        transaction_timeout: 10
        health_sensitivity: 100
        recovery_interval: 1
        router_sensitivity: 100
        [root@cannonlnet07 ~]# ibstat
        CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de0
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 1422
        LMC: 0
        SM lid: 1434
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de0
        Link layer: InfiniBand
        CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de1
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 2259
        LMC: 0
        SM lid: 158
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de1
        Link layer: InfiniBand
        [root@cannonlnet07 ~]# ifconfig ib0
        ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
        inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
        inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
        Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
        RX packets 343090 bytes 34310200 (32.7 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 112049 bytes 6723124 (6.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      [root@cannonlnet07 ~]# ifconfig ib1
      ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
      inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
      inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
      Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
      infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
      RX packets 495909 bytes 50980886 (48.6 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 18846 bytes 1130904 (1.0 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      Attachments

        1. lustre.log.gz
          34.16 MB
        2. lustre.log.gz
          34.16 MB

        Activity

          [LU-13367] lnet_handle_local_failure messages every 10 min ?

          Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below.

          Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max)
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages
          Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread...

          lustre.log.gz

          mre64 Michael Ethier (Inactive) added a comment - Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below. Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max) Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread... lustre.log.gz

          I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100.

          If you still see the problem, I would:

          lctl set_param debug=+net
          lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G (if you have the space)
          # wait until the problem happens
          lctl debug_daemon stop
          lctl set_param debug=-net
          lctl debug_file lustre.dk lustre.log
          # attach or upload the lustre.log file depending on how big the file is

          Here is the relevant lustre manual section for debug_daemon commands:

          37.2.3.1. lctl debug_daemon Commands

          https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600

          ashehata Amir Shehata (Inactive) added a comment - I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100. If you still see the problem, I would: lctl set_param debug=+net lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G ( if you have the space) # wait until the problem happens lctl debug_daemon stop lctl set_param debug=-net lctl debug_file lustre.dk lustre.log # attach or upload the lustre.log file depending on how big the file is Here is the relevant lustre manual section for debug_daemon commands: 37.2.3.1. lctl debug_daemon Commands https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600

          I set those 2 parameters and I still see the recovery messages:

          [root@cannonlnet07 ~]# lnetctl global show
          global:
          numa_range: 0
          max_intf: 200
          discovery: 0
          drop_asym_route: 0
          retry_count: 2
          transaction_timeout: 50
          health_sensitivity: 100
          recovery_interval: 1
          router_sensitivity: 100

          mre64 Michael Ethier (Inactive) added a comment - I set those 2 parameters and I still see the recovery messages: [root@cannonlnet07 ~] # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 0 drop_asym_route: 0 retry_count: 2 transaction_timeout: 50 health_sensitivity: 100 recovery_interval: 1 router_sensitivity: 100
          ashehata Amir Shehata (Inactive) added a comment - - edited

          Before we capture the logs, can we try the below recommendation and monitor the errors.

          I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code.

          I see that you have: transaction_timeout: 10 and retry_count: 3

          That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is:

          LU-13145 lnet: use conservative health timeouts

          You can set it manually:

          lnetctl set transaction_timeout 50 
          lnetctl set retry_count 2
          ashehata Amir Shehata (Inactive) added a comment - - edited Before we capture the logs, can we try the below recommendation and monitor the errors. I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code. I see that you have: transaction_timeout: 10 and retry_count: 3 That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is: LU-13145 lnet: use conservative health timeouts You can set it manually: lnetctl set transaction_timeout 50 lnetctl set retry_count 2

          Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?

          what are the additional commands to capture the "logs" ?

          mre64 Michael Ethier (Inactive) added a comment - Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time? what are the additional commands to capture the "logs" ?

          [root@cannonlnet07 ~]# lnetctl stats show
          statistics:
          msgs_alloc: 1518
          msgs_max: 16396
          rst_alloc: 568
          errors: 0
          send_count: 9398737
          resend_count: 12378
          response_timeout_count: 28678
          local_interrupt_count: 0
          local_dropped_count: 24613
          local_aborted_count: 0
          local_no_route_count: 0
          local_timeout_count: 14188
          local_error_count: 0
          remote_dropped_count: 3862
          remote_error_count: 0
          remote_timeout_count: 0
          network_timeout_count: 0
          recv_count: 9398737
          route_count: 2778855491
          drop_count: 50252
          send_length: 1052296560
          recv_length: 283792
          route_length: 127811084565569
          drop_length: 24066088

          Last 200 lines of lnet messages, I don't see anything of big interest, do you ?
          [root@cannonlnet07 ~]# tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet
          Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds
          Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages
          Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
          Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
          Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
          Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
          Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
          Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages
          Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
          Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message
          Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
          Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
          Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages

          Sure I will enable debug and send the logs

          mre64 Michael Ethier (Inactive) added a comment - [root@cannonlnet07 ~] # lnetctl stats show statistics: msgs_alloc: 1518 msgs_max: 16396 rst_alloc: 568 errors: 0 send_count: 9398737 resend_count: 12378 response_timeout_count: 28678 local_interrupt_count: 0 local_dropped_count: 24613 local_aborted_count: 0 local_no_route_count: 0 local_timeout_count: 14188 local_error_count: 0 remote_dropped_count: 3862 remote_error_count: 0 remote_timeout_count: 0 network_timeout_count: 0 recv_count: 9398737 route_count: 2778855491 drop_count: 50252 send_length: 1052296560 recv_length: 283792 route_length: 127811084565569 drop_length: 24066088 Last 200 lines of lnet messages, I don't see anything of big interest, do you ? [root@cannonlnet07 ~] # tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 14:33:40 cannonlnet07 dbus [1267] : [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 17 14:33:40 cannonlnet07 dbus [1267] : [system] Successfully activated service 'org.freedesktop.problems' Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4 Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4 Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message Mar 17 15:56:14 cannonlnet07 dbus [1267] : [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 17 15:56:14 cannonlnet07 dbus [1267] : [system] Successfully activated service 'org.freedesktop.problems' Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages Sure I will enable debug and send the logs

          These messages have been reduced in severity. Here is the patch:

          LU-13071 lnet: reduce log severity for health events

          Would be interesting to know what happens at the 10 minute interval, though.

          Can you share the output of "lnetctl stats show"

          Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?

           

          ashehata Amir Shehata (Inactive) added a comment - These messages have been reduced in severity. Here is the patch: LU-13071 lnet: reduce log severity for health events Would be interesting to know what happens at the 10 minute interval, though. Can you share the output of "lnetctl stats show" Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?  

          Also these systems are using Mellanox OFED stack installed on all 8 of these is the following:

          [root@cannonlnet07 ~]# ofed_info
          MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0):
          ar_mgr:
          osm_plugins/ar_mgr/ar_mgr-1.0-0.45.MLNX20190923.g5aec6dc.tar.gz

          cc_mgr:
          osm_plugins/cc_mgr/cc_mgr-1.0-0.44.MLNX20190923.g5aec6dc.tar.gz

          dapl:
          dapl.git mlnx_ofed_4_0
          commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d
          dump_pr:
          osm_plugins/dump_pr//dump_pr-1.0-0.40.MLNX20190923.g5aec6dc.tar.gz

          fabric-collector:
          fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz

          gpio-mlxbf:
          mlnx_ofed_soc/gpio-mlxbf-1.0-0.g6d44a8a.src.rpm

          hcoll:
          mlnx_ofed_hcol/hcoll-4.4.2938-1.src.rpm

          i2c-mlx:
          mlnx_ofed_soc/i2c-mlx-1.0-0.gab579c6.src.rpm

          ibacm:
          mlnx_ofed/ibacm.git mlnx_ofed_4_1
          commit 4ae5d193f628c71bac481218bc88d0f77f7eff9a
          ibdump:
          sniffer/sniffer-5.0.0-3/ibdump/linux/ibdump-5.0.0-3.tgz

          ibsim:
          mlnx_ofed_ibsim/ibsim-0.7mlnx1-0.11.g85c342b.tar.gz

          ibutils:
          ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz

          ibutils2:
          ibutils2/ibutils2-2.1.1-0.110.MLNX20190922.gd4efc48.tar.gz

          infiniband-diags:
          mlnx_ofed_infiniband_diags/infiniband-diags-5.4.0.MLNX20190908.5f40e4f.tar.gz

          iser:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          isert:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          kernel-mft:
          mlnx_ofed_mft/kernel-mft-4.13.0-102.src.rpm

          knem:
          knem.git mellanox-master
          commit 8d875eed562b1df72365302881f3aab1b33b23b8
          libdisni:
          upstream/disni.git master
          commit b50eee08b66b14b02594ee485cdc7dfd9820bbae
          libibcm:
          mlnx_ofed/libibcm.git mlnx_ofed_4_1
          commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e
          libibmad:
          mlnx_ofed_libibmad/libibmad-5.4.0.MLNX20190423.1d917ae.tar.gz

          libibumad:
          mlnx_ofed_libibumad/libibumad-43.1.1.MLNX20190905.1080879.tar.gz

          libibverbs:
          mlnx_ofed/libibverbs.git mlnx_ofed_4_7
          commit fc0a8c1ccb6f883b5ed321f1d17fb9f3b4f85bef
          libmlx4:
          mlnx_ofed/libmlx4.git mlnx_ofed_4_7
          commit 819cf8a7fc4ec35065659b97035159bb11128bb4
          libmlx5:
          mlnx_ofed/libmlx5.git mlnx_ofed_4_7
          commit aaa125580c1d20b824569ad18df71c2a051f2697
          libpka:
          mlnx_ofed_soc/libpka-1.0-1.g6cc68a2.src.rpm

          librdmacm:
          mlnx_ofed/librdmacm.git mlnx_ofed_4_7
          commit 9cac524f5dbba911c2927b20e7137fb0e2f1b995
          librxe:
          mlnx_ofed/librxe.git master
          commit afd75c6164324ac68a3e39f3c988ce85978669d4
          libvma:
          vma/source_rpms//libvma-8.9.4-0.src.rpm

          mlnx-en:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          mlnx-ethtool:
          mlnx_ofed/ethtool.git mlnx_ofed_4_7
          commit 5b82426a01b5d76d6f5a1c58e1e419323ab29eaf
          mlnx-iproute2:
          mlnx_ofed/iproute2.git mlnx_ofed_4_7
          commit 1c4837862432f8d4a7fdd4bc8f598d7ebab354c3
          mlnx-nfsrdma:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          mlnx-nvme:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          mlnx-ofa_kernel:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          mlnx-rdma-rxe:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          mlx-bootctl:
          mlnx_ofed_soc/mlx-bootctl-1.1-0.g5b90483.src.rpm

          mlx-l3cache:
          mlnx_ofed_soc/mlx-l3cache-0.1-1.gebb0728.src.rpm

          mlx-pmc:
          mlnx_ofed_soc/mlx-pmc-1.1-0.g1141c2e.src.rpm

          mlx-trio:
          mlnx_ofed_soc/mlx-trio-0.1-1.g9d13513.src.rpm

          mlxbf-livefish:
          mlnx_ofed_soc/mlxbf-livefish-1.0-0.gec08328.src.rpm

          mpi-selector:
          ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm

          mpitests:
          mlnx_ofed_mpitest/mpitests-3.2.20-e1a0676.src.rpm

          mstflint:
          mlnx_ofed_mstflint/mstflint-4.13.0-1.41.g4e8819c.tar.gz

          multiperf:
          mlnx_ofed_multiperf/multiperf-3.0-0.13.gcdaa426.tar.gz

          mxm:
          mlnx_ofed_mxm/mxm-3.7.3112-1.src.rpm

          nvme-snap:
          nvme/nvme-snap-2.0.1-2.mlnx.src.rpm

          ofed-docs:
          docs.git mlnx_ofed-4.0
          commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

          openmpi:
          mlnx_ofed_ompi_1.8/openmpi-4.0.2rc3-1.src.rpm

          opensm:
          mlnx_ofed_opensm/opensm-5.5.0.MLNX20190923.1c78385.tar.gz

          openvswitch:
          openvswitch.git mlnx_ofed_4_7
          commit 5f4b5233413b20960336fa06ba309b53e9a99ae5
          perftest:
          mlnx_ofed_perftest/perftest-4.4-0.8.g7af08be.tar.gz

          pka-mlxbf:
          mlnx_ofed_soc/pka-mlxbf-1.0-0.gd3455da.src.rpm

          qperf:
          /sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-4.6-1.0.1/SRPMS/qperf-0.4.9-9.46101.src.rpm

          rdma-core:
          mlnx_ofed/rdma-core.git mlnx_ofed_4_7
          commit 87e9049a32f7cda5f5aa36145400fb660f4dbc18
          rshim:
          mlnx_ofed_soc/rshim-1.8-0.g463f780.src.rpm

          sharp:
          mlnx_ofed_sharp/sharp-2.0.0.MLNX20190922.a9ebf22.tar.gz

          sockperf:
          sockperf/sockperf-3.6-0.git737d1e3e5576.src.rpm

          srp:
          mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7
          commit 1c4bf4249c363d1ab555dcc246f88d99cae96175

          srptools:
          srptools/srptools-41mlnx1-5.src.rpm

          tmfifo:
          mlnx_ofed_soc/tmfifo-1.3-0.g6aa30c7.src.rpm

          ucx:
          mlnx_ofed_ucx/ucx-1.7.0-1.src.rpm

          Installed Packages:
          -------------------

          kmod-srp
          libibverbs-utils
          libibcm-devel
          ibacm
          dapl
          infiniband-diags
          qperf
          ucx-rdmacm
          knem
          libibverbs
          librxe
          libibmad-devel
          opensm
          mstflint
          ar_mgr
          ucx-cma
          mlnx-iproute2
          kmod-mlnx-ofa_kernel
          kmod-knem
          kmod-rshim
          libibverbs-devel
          libmlx4-devel
          librxe-devel-static
          libibumad-devel
          libibmad-static
          librdmacm-utils
          opensm-devel
          dapl-devel-static
          ibutils
          ibdump
          ucx
          ucx-ib
          hcoll
          mlnxofed-docs
          mlnx-ofa_kernel-devel
          kmod-iser
          mpi-selector
          libibverbs-devel-static
          libmlx5
          libibcm
          libibumad-static
          ibsim
          librdmacm-devel
          opensm-static
          dapl-utils
          srptools
          cc_mgr
          infiniband-diags-compat
          ucx-devel
          ucx-ib-cm
          openmpi
          mpitests_openmpi
          kmod-kernel-mft-mlnx
          libmlx5-devel
          libibmad
          opensm-libs
          perftest
          dump_pr
          sharp
          mlnx-ethtool
          mlnx-ofa_kernel
          kmod-isert
          libmlx4
          libibumad
          librdmacm
          dapl-devel
          ibutils2
          mxm
          ucx-knem

          mre64 Michael Ethier (Inactive) added a comment - Also these systems are using Mellanox OFED stack installed on all 8 of these is the following: [root@cannonlnet07 ~] # ofed_info MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0): ar_mgr: osm_plugins/ar_mgr/ar_mgr-1.0-0.45.MLNX20190923.g5aec6dc.tar.gz cc_mgr: osm_plugins/cc_mgr/cc_mgr-1.0-0.44.MLNX20190923.g5aec6dc.tar.gz dapl: dapl.git mlnx_ofed_4_0 commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d dump_pr: osm_plugins/dump_pr//dump_pr-1.0-0.40.MLNX20190923.g5aec6dc.tar.gz fabric-collector: fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz gpio-mlxbf: mlnx_ofed_soc/gpio-mlxbf-1.0-0.g6d44a8a.src.rpm hcoll: mlnx_ofed_hcol/hcoll-4.4.2938-1.src.rpm i2c-mlx: mlnx_ofed_soc/i2c-mlx-1.0-0.gab579c6.src.rpm ibacm: mlnx_ofed/ibacm.git mlnx_ofed_4_1 commit 4ae5d193f628c71bac481218bc88d0f77f7eff9a ibdump: sniffer/sniffer-5.0.0-3/ibdump/linux/ibdump-5.0.0-3.tgz ibsim: mlnx_ofed_ibsim/ibsim-0.7mlnx1-0.11.g85c342b.tar.gz ibutils: ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz ibutils2: ibutils2/ibutils2-2.1.1-0.110.MLNX20190922.gd4efc48.tar.gz infiniband-diags: mlnx_ofed_infiniband_diags/infiniband-diags-5.4.0.MLNX20190908.5f40e4f.tar.gz iser: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 isert: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 kernel-mft: mlnx_ofed_mft/kernel-mft-4.13.0-102.src.rpm knem: knem.git mellanox-master commit 8d875eed562b1df72365302881f3aab1b33b23b8 libdisni: upstream/disni.git master commit b50eee08b66b14b02594ee485cdc7dfd9820bbae libibcm: mlnx_ofed/libibcm.git mlnx_ofed_4_1 commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e libibmad: mlnx_ofed_libibmad/libibmad-5.4.0.MLNX20190423.1d917ae.tar.gz libibumad: mlnx_ofed_libibumad/libibumad-43.1.1.MLNX20190905.1080879.tar.gz libibverbs: mlnx_ofed/libibverbs.git mlnx_ofed_4_7 commit fc0a8c1ccb6f883b5ed321f1d17fb9f3b4f85bef libmlx4: mlnx_ofed/libmlx4.git mlnx_ofed_4_7 commit 819cf8a7fc4ec35065659b97035159bb11128bb4 libmlx5: mlnx_ofed/libmlx5.git mlnx_ofed_4_7 commit aaa125580c1d20b824569ad18df71c2a051f2697 libpka: mlnx_ofed_soc/libpka-1.0-1.g6cc68a2.src.rpm librdmacm: mlnx_ofed/librdmacm.git mlnx_ofed_4_7 commit 9cac524f5dbba911c2927b20e7137fb0e2f1b995 librxe: mlnx_ofed/librxe.git master commit afd75c6164324ac68a3e39f3c988ce85978669d4 libvma: vma/source_rpms//libvma-8.9.4-0.src.rpm mlnx-en: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 mlnx-ethtool: mlnx_ofed/ethtool.git mlnx_ofed_4_7 commit 5b82426a01b5d76d6f5a1c58e1e419323ab29eaf mlnx-iproute2: mlnx_ofed/iproute2.git mlnx_ofed_4_7 commit 1c4837862432f8d4a7fdd4bc8f598d7ebab354c3 mlnx-nfsrdma: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 mlnx-nvme: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 mlnx-ofa_kernel: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 mlnx-rdma-rxe: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 mlx-bootctl: mlnx_ofed_soc/mlx-bootctl-1.1-0.g5b90483.src.rpm mlx-l3cache: mlnx_ofed_soc/mlx-l3cache-0.1-1.gebb0728.src.rpm mlx-pmc: mlnx_ofed_soc/mlx-pmc-1.1-0.g1141c2e.src.rpm mlx-trio: mlnx_ofed_soc/mlx-trio-0.1-1.g9d13513.src.rpm mlxbf-livefish: mlnx_ofed_soc/mlxbf-livefish-1.0-0.gec08328.src.rpm mpi-selector: ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm mpitests: mlnx_ofed_mpitest/mpitests-3.2.20-e1a0676.src.rpm mstflint: mlnx_ofed_mstflint/mstflint-4.13.0-1.41.g4e8819c.tar.gz multiperf: mlnx_ofed_multiperf/multiperf-3.0-0.13.gcdaa426.tar.gz mxm: mlnx_ofed_mxm/mxm-3.7.3112-1.src.rpm nvme-snap: nvme/nvme-snap-2.0.1-2.mlnx.src.rpm ofed-docs: docs.git mlnx_ofed-4.0 commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc openmpi: mlnx_ofed_ompi_1.8/openmpi-4.0.2rc3-1.src.rpm opensm: mlnx_ofed_opensm/opensm-5.5.0.MLNX20190923.1c78385.tar.gz openvswitch: openvswitch.git mlnx_ofed_4_7 commit 5f4b5233413b20960336fa06ba309b53e9a99ae5 perftest: mlnx_ofed_perftest/perftest-4.4-0.8.g7af08be.tar.gz pka-mlxbf: mlnx_ofed_soc/pka-mlxbf-1.0-0.gd3455da.src.rpm qperf: /sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-4.6-1.0.1/SRPMS/qperf-0.4.9-9.46101.src.rpm rdma-core: mlnx_ofed/rdma-core.git mlnx_ofed_4_7 commit 87e9049a32f7cda5f5aa36145400fb660f4dbc18 rshim: mlnx_ofed_soc/rshim-1.8-0.g463f780.src.rpm sharp: mlnx_ofed_sharp/sharp-2.0.0.MLNX20190922.a9ebf22.tar.gz sockperf: sockperf/sockperf-3.6-0.git737d1e3e5576.src.rpm srp: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_7 commit 1c4bf4249c363d1ab555dcc246f88d99cae96175 srptools: srptools/srptools-41mlnx1-5.src.rpm tmfifo: mlnx_ofed_soc/tmfifo-1.3-0.g6aa30c7.src.rpm ucx: mlnx_ofed_ucx/ucx-1.7.0-1.src.rpm Installed Packages: ------------------- kmod-srp libibverbs-utils libibcm-devel ibacm dapl infiniband-diags qperf ucx-rdmacm knem libibverbs librxe libibmad-devel opensm mstflint ar_mgr ucx-cma mlnx-iproute2 kmod-mlnx-ofa_kernel kmod-knem kmod-rshim libibverbs-devel libmlx4-devel librxe-devel-static libibumad-devel libibmad-static librdmacm-utils opensm-devel dapl-devel-static ibutils ibdump ucx ucx-ib hcoll mlnxofed-docs mlnx-ofa_kernel-devel kmod-iser mpi-selector libibverbs-devel-static libmlx5 libibcm libibumad-static ibsim librdmacm-devel opensm-static dapl-utils srptools cc_mgr infiniband-diags-compat ucx-devel ucx-ib-cm openmpi mpitests_openmpi kmod-kernel-mft-mlnx libmlx5-devel libibmad opensm-libs perftest dump_pr sharp mlnx-ethtool mlnx-ofa_kernel kmod-isert libmlx4 libibumad librdmacm dapl-devel ibutils2 mxm ucx-knem

          People

            ashehata Amir Shehata (Inactive)
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: