Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13367

lnet_handle_local_failure messages every 10 min ?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.
    • 3
    • 9223372036854775807

    Description

      Hello,
      We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
      Thanks,
      Mike

      Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages

      [root@cannonlnet07 ~]# nslookup 10.31.160.253
      253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# nslookup 10.31.179.178
      178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
      options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
      options lnet forwarding="enabled"
      options lnet lnet_peer_discovery_disabled=1
      [root@cannonlnet07 ~]# lnetctl net show
      net:

      • net type: lo
        local NI(s):
      • nid: 0@lo
        status: up
      • net type: o2ib
        local NI(s):
      • nid: 10.31.160.253@o2ib
        status: up
        interfaces:
        0: ib1
      • net type: o2ib2
        local NI(s):
      • nid: 10.31.160.253@o2ib2
        status: up
        interfaces:
        0: ib1
      • net type: o2ib4
        local NI(s):
      • nid: 10.31.179.178@o2ib4
        status: up
        interfaces:
        0: ib0
      • net type: tcp
        local NI(s):
      • nid: 10.31.8.93@tcp
        status: down
        interfaces:
        0: bond0
      • net type: tcp4
        local NI(s):
      • nid: 10.31.73.39@tcp4
        status: down
        interfaces:
        0: bond0.2475
        [root@cannonlnet07 ~]# lnetctl stats show
        statistics:
        msgs_alloc: 1517
        msgs_max: 16396
        rst_alloc: 568
        errors: 0
        send_count: 9287639
        resend_count: 12378
        response_timeout_count: 28115
        local_interrupt_count: 0
        local_dropped_count: 24050
        local_aborted_count: 0
        local_no_route_count: 0
        local_timeout_count: 14188
        local_error_count: 0
        remote_dropped_count: 3862
        remote_error_count: 0
        remote_timeout_count: 0
        network_timeout_count: 0
        recv_count: 9287639
        route_count: 2744617426
        drop_count: 50252
        send_length: 1039854144
        recv_length: 283232
        route_length: 125943144442551
        drop_length: 24066088
        [root@cannonlnet07 ~]# lnetctl global show
        global:
        numa_range: 0
        max_intf: 200
        discovery: 0
        drop_asym_route: 0
        retry_count: 3
        transaction_timeout: 10
        health_sensitivity: 100
        recovery_interval: 1
        router_sensitivity: 100
        [root@cannonlnet07 ~]# ibstat
        CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de0
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 1422
        LMC: 0
        SM lid: 1434
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de0
        Link layer: InfiniBand
        CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de1
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 2259
        LMC: 0
        SM lid: 158
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de1
        Link layer: InfiniBand
        [root@cannonlnet07 ~]# ifconfig ib0
        ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
        inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
        inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
        Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
        RX packets 343090 bytes 34310200 (32.7 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 112049 bytes 6723124 (6.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      [root@cannonlnet07 ~]# ifconfig ib1
      ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
      inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
      inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
      Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
      infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
      RX packets 495909 bytes 50980886 (48.6 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 18846 bytes 1130904 (1.0 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      Attachments

        1. lustre.log.gz
          34.16 MB
        2. lustre.log.gz
          34.16 MB

        Activity

          [LU-13367] lnet_handle_local_failure messages every 10 min ?

          Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34.
          Mike

          mre64 Michael Ethier (Inactive) added a comment - Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34. Mike

          That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message.

          The address resolution error is only seen when net logging is turned on.

          ashehata Amir Shehata (Inactive) added a comment - That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message. The address resolution error is only seen when net logging is turned on.

          Hi,
          Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn't cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won't be found in the /var/log/messages I assume.
          Mike

          [root@holy2c18110 ~]# ifconfig ib0
          ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
          inet 10.31.167.172 netmask 255.255.240.0 broadcast 10.31.175.255
          inet6 fe80::ee0d:9a03:12:8991 prefixlen 64 scopeid 0x20<link>
          Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
          infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 1024 (InfiniBand)
          RX packets 687361 bytes 39139716 (37.3 MiB)
          RX errors 0 dropped 0 overruns 0 frame 0
          TX packets 7705 bytes 654574 (639.2 KiB)
          TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

          [root@holy2c18110 ~]# ibstat
          CA 'mlx4_0'
          CA type: MT4099
          Number of ports: 2
          Firmware version: 2.42.5000
          Hardware version: 1
          Node GUID: 0xec0d9a0300128990
          System image GUID: 0xec0d9a0300128993
          Port 1:
          State: Active
          Physical state: LinkUp
          Rate: 56
          Base lid: 832
          LMC: 0
          SM lid: 158
          Capability mask: 0x02514868
          Port GUID: 0xec0d9a0300128991
          Link layer: InfiniBand
          Port 2:
          State: Down
          Physical state: Polling
          Rate: 10
          Base lid: 0
          LMC: 0
          SM lid: 0
          Capability mask: 0x02514868
          Port GUID: 0xec0d9a0300128992
          Link layer: InfiniBand
          [root@holy2c18110 ~]# uptime
          17:24:26 up 1 day, 7:05, 1 user, load average: 0.00, 0.01, 0.05

          mre64 Michael Ethier (Inactive) added a comment - Hi, Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn't cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won't be found in the /var/log/messages I assume. Mike [root@holy2c18110 ~] # ifconfig ib0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 inet 10.31.167.172 netmask 255.255.240.0 broadcast 10.31.175.255 inet6 fe80::ee0d:9a03:12:8991 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 1024 (InfiniBand) RX packets 687361 bytes 39139716 (37.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 7705 bytes 654574 (639.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@holy2c18110 ~] # ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xec0d9a0300128990 System image GUID: 0xec0d9a0300128993 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 832 LMC: 0 SM lid: 158 Capability mask: 0x02514868 Port GUID: 0xec0d9a0300128991 Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02514868 Port GUID: 0xec0d9a0300128992 Link layer: InfiniBand [root@holy2c18110 ~] # uptime 17:24:26 up 1 day, 7:05, 1 user, load average: 0.00, 0.01, 0.05

          So it looks like there are a couple of nodes which are causing all the problems:

          00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed
          00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed

          Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message.

          The reason connection to these NIDs are failing is due to:

          (o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110

          We're trying to resolve them and we're timing out.

          Are these nodes "real"? Are they left over configuration?

          ashehata Amir Shehata (Inactive) added a comment - So it looks like there are a couple of nodes which are causing all the problems: 00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed 00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message. The reason connection to these NIDs are failing is due to: (o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110 We're trying to resolve them and we're timing out. Are these nodes "real"? Are they left over configuration?

          Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ?
          Thanks,
          Mike

          mre64 Michael Ethier (Inactive) added a comment - Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ? Thanks, Mike

          Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below.

          Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max)
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages
          Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread...

          lustre.log.gz

          mre64 Michael Ethier (Inactive) added a comment - Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below. Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max) Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread... lustre.log.gz

          I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100.

          If you still see the problem, I would:

          lctl set_param debug=+net
          lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G (if you have the space)
          # wait until the problem happens
          lctl debug_daemon stop
          lctl set_param debug=-net
          lctl debug_file lustre.dk lustre.log
          # attach or upload the lustre.log file depending on how big the file is

          Here is the relevant lustre manual section for debug_daemon commands:

          37.2.3.1. lctl debug_daemon Commands

          https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600

          ashehata Amir Shehata (Inactive) added a comment - I'm wondering if there is a threshold for the transaction_timeout where this goes a way. Can you try setting that to 100. If you still see the problem, I would: lctl set_param debug=+net lctl debug_daemon start lustre.dk [megabytes] # make it as big as possible 1G ( if you have the space) # wait until the problem happens lctl debug_daemon stop lctl set_param debug=-net lctl debug_file lustre.dk lustre.log # attach or upload the lustre.log file depending on how big the file is Here is the relevant lustre manual section for debug_daemon commands: 37.2.3.1. lctl debug_daemon Commands https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140539595165600

          I set those 2 parameters and I still see the recovery messages:

          [root@cannonlnet07 ~]# lnetctl global show
          global:
          numa_range: 0
          max_intf: 200
          discovery: 0
          drop_asym_route: 0
          retry_count: 2
          transaction_timeout: 50
          health_sensitivity: 100
          recovery_interval: 1
          router_sensitivity: 100

          mre64 Michael Ethier (Inactive) added a comment - I set those 2 parameters and I still see the recovery messages: [root@cannonlnet07 ~] # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 0 drop_asym_route: 0 retry_count: 2 transaction_timeout: 50 health_sensitivity: 100 recovery_interval: 1 router_sensitivity: 100
          ashehata Amir Shehata (Inactive) added a comment - - edited

          Before we capture the logs, can we try the below recommendation and monitor the errors.

          I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code.

          I see that you have: transaction_timeout: 10 and retry_count: 3

          That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is:

          LU-13145 lnet: use conservative health timeouts

          You can set it manually:

          lnetctl set transaction_timeout 50 
          lnetctl set retry_count 2
          ashehata Amir Shehata (Inactive) added a comment - - edited Before we capture the logs, can we try the below recommendation and monitor the errors. I see a few tx timeouts and a couple of PUT_NACKs. These could result in the failures of some RDMAs, which triggers the health code. I see that you have: transaction_timeout: 10 and retry_count: 3 That has been defaulted to 50s and 2 respectively. We found that on larger clusters that timeout is too short, causing RDMA timeouts. Can you try setting that to 50s. The patch which changed the default is: LU-13145 lnet: use conservative health timeouts You can set it manually: lnetctl set transaction_timeout 50 lnetctl set retry_count 2

          Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time?

          what are the additional commands to capture the "logs" ?

          mre64 Michael Ethier (Inactive) added a comment - Are there any other errors around the same time? Would it be possible to enable net logging (lctl set_param debug=+net) and capture the logs around that time? what are the additional commands to capture the "logs" ?

          [root@cannonlnet07 ~]# lnetctl stats show
          statistics:
          msgs_alloc: 1518
          msgs_max: 16396
          rst_alloc: 568
          errors: 0
          send_count: 9398737
          resend_count: 12378
          response_timeout_count: 28678
          local_interrupt_count: 0
          local_dropped_count: 24613
          local_aborted_count: 0
          local_no_route_count: 0
          local_timeout_count: 14188
          local_error_count: 0
          remote_dropped_count: 3862
          remote_error_count: 0
          remote_timeout_count: 0
          network_timeout_count: 0
          recv_count: 9398737
          route_count: 2778855491
          drop_count: 50252
          send_length: 1052296560
          recv_length: 283792
          route_length: 127811084565569
          drop_length: 24066088

          Last 200 lines of lnet messages, I don't see anything of big interest, do you ?
          [root@cannonlnet07 ~]# tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet
          Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds
          Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages
          Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
          Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
          Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds
          Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
          Mar 17 14:33:40 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
          Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
          Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
          Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
          Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
          Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
          Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages
          Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4
          Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message
          Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
          Mar 17 15:56:14 cannonlnet07 dbus[1267]: [system] Successfully activated service 'org.freedesktop.problems'
          Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages

          Sure I will enable debug and send the logs

          mre64 Michael Ethier (Inactive) added a comment - [root@cannonlnet07 ~] # lnetctl stats show statistics: msgs_alloc: 1518 msgs_max: 16396 rst_alloc: 568 errors: 0 send_count: 9398737 resend_count: 12378 response_timeout_count: 28678 local_interrupt_count: 0 local_dropped_count: 24613 local_aborted_count: 0 local_no_route_count: 0 local_timeout_count: 14188 local_error_count: 0 remote_dropped_count: 3862 remote_error_count: 0 remote_timeout_count: 0 network_timeout_count: 0 recv_count: 9398737 route_count: 2778855491 drop_count: 50252 send_length: 1052296560 recv_length: 283792 route_length: 127811084565569 drop_length: 24066088 Last 200 lines of lnet messages, I don't see anything of big interest, do you ? [root@cannonlnet07 ~] # tail -n 200 /var/log/messages |grep -v dhclient |grep -v puppet Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:36:53 cannonlnet07 kernel: LNetError: 79056:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.176.98@o2ib4: 1 seconds Mar 17 12:41:51 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 20 previous similar messages Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:46:53 cannonlnet07 kernel: LNetError: 81823:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 12:57:03 cannonlnet07 kernel: LNetError: 82106:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 13:07:08 cannonlnet07 kernel: LNetError: 83710:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 13:25:52 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages Mar 17 14:21:13 cannonlnet07 kernel: LNet: 3013:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.31.167.172@o2ib: 0 seconds Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 14:33:40 cannonlnet07 dbus [1267] : [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 17 14:33:40 cannonlnet07 dbus [1267] : [system] Successfully activated service 'org.freedesktop.problems' Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 14:48:23 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 14:58:38 cannonlnet07 kernel: LNetError: 88976:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:08:48 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 15:18:53 cannonlnet07 kernel: LNetError: 91059:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900 Mar 17 15:28:53 cannonlnet07 kernel: LNetError: 92167:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:38:58 cannonlnet07 kernel: LNetError: 92448:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages Mar 17 15:47:09 cannonlnet07 kernel: LNet: 3014:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4 Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:49:08 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 87 previous similar messages Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.31.179.139@o2ib4 Mar 17 15:51:00 cannonlnet07 kernel: LNet: 3015:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 1 previous similar message Mar 17 15:56:14 cannonlnet07 dbus [1267] : [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 17 15:56:14 cannonlnet07 dbus [1267] : [system] Successfully activated service 'org.freedesktop.problems' Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 15:59:13 cannonlnet07 kernel: LNetError: 93319:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 82 previous similar messages Sure I will enable debug and send the logs

          People

            ashehata Amir Shehata (Inactive)
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: