Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13367

lnet_handle_local_failure messages every 10 min ?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.
    • 3
    • 9223372036854775807

    Description

      Hello,
      We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
      Thanks,
      Mike

      Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages

      [root@cannonlnet07 ~]# nslookup 10.31.160.253
      253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# nslookup 10.31.179.178
      178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
      options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
      options lnet forwarding="enabled"
      options lnet lnet_peer_discovery_disabled=1
      [root@cannonlnet07 ~]# lnetctl net show
      net:

      • net type: lo
        local NI(s):
      • nid: 0@lo
        status: up
      • net type: o2ib
        local NI(s):
      • nid: 10.31.160.253@o2ib
        status: up
        interfaces:
        0: ib1
      • net type: o2ib2
        local NI(s):
      • nid: 10.31.160.253@o2ib2
        status: up
        interfaces:
        0: ib1
      • net type: o2ib4
        local NI(s):
      • nid: 10.31.179.178@o2ib4
        status: up
        interfaces:
        0: ib0
      • net type: tcp
        local NI(s):
      • nid: 10.31.8.93@tcp
        status: down
        interfaces:
        0: bond0
      • net type: tcp4
        local NI(s):
      • nid: 10.31.73.39@tcp4
        status: down
        interfaces:
        0: bond0.2475
        [root@cannonlnet07 ~]# lnetctl stats show
        statistics:
        msgs_alloc: 1517
        msgs_max: 16396
        rst_alloc: 568
        errors: 0
        send_count: 9287639
        resend_count: 12378
        response_timeout_count: 28115
        local_interrupt_count: 0
        local_dropped_count: 24050
        local_aborted_count: 0
        local_no_route_count: 0
        local_timeout_count: 14188
        local_error_count: 0
        remote_dropped_count: 3862
        remote_error_count: 0
        remote_timeout_count: 0
        network_timeout_count: 0
        recv_count: 9287639
        route_count: 2744617426
        drop_count: 50252
        send_length: 1039854144
        recv_length: 283232
        route_length: 125943144442551
        drop_length: 24066088
        [root@cannonlnet07 ~]# lnetctl global show
        global:
        numa_range: 0
        max_intf: 200
        discovery: 0
        drop_asym_route: 0
        retry_count: 3
        transaction_timeout: 10
        health_sensitivity: 100
        recovery_interval: 1
        router_sensitivity: 100
        [root@cannonlnet07 ~]# ibstat
        CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de0
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 1422
        LMC: 0
        SM lid: 1434
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de0
        Link layer: InfiniBand
        CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de1
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 2259
        LMC: 0
        SM lid: 158
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de1
        Link layer: InfiniBand
        [root@cannonlnet07 ~]# ifconfig ib0
        ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
        inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
        inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
        Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
        RX packets 343090 bytes 34310200 (32.7 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 112049 bytes 6723124 (6.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      [root@cannonlnet07 ~]# ifconfig ib1
      ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
      inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
      inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
      Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
      infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
      RX packets 495909 bytes 50980886 (48.6 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 18846 bytes 1130904 (1.0 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      Attachments

        1. lustre.log.gz
          34.16 MB
        2. lustre.log.gz
          34.16 MB

        Activity

          [LU-13367] lnet_handle_local_failure messages every 10 min ?
          jhammond John Hammond added a comment -

          Both changes referenced above (LU-13071 and LU-13145) are in b2_12.

          jhammond John Hammond added a comment - Both changes referenced above ( LU-13071 and LU-13145 ) are in b2_12 .

          Thanks for the help. I will let others on my team know about the 2.12 vs 2.13 recommendations.

          mre64 Michael Ethier (Inactive) added a comment - Thanks for the help. I will let others on my team know about the 2.12 vs 2.13 recommendations.

          Sure. If there is no need for the features in 2.13, then the latest 2.12.x would suffice.

          ashehata Amir Shehata (Inactive) added a comment - Sure. If there is no need for the features in 2.13, then the latest 2.12.x would suffice.
          pjones Peter Jones added a comment -

          My recommendation would be to use a 2.12.x release. If there is a bug fix missing from the 2.12.x branch we can include that in 2.12.5. ashehata do you agree?

          pjones Peter Jones added a comment - My recommendation would be to use a 2.12.x release. If there is a bug fix missing from the 2.12.x branch we can include that in 2.12.5. ashehata do you agree?

          Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34.
          Mike

          mre64 Michael Ethier (Inactive) added a comment - Hi, ok thanks. Do you recommend we upgrade to the newer version of 2.13.x or drop down to 2.12.4 or 5 when it comes out? We don't care really about the health and multi-rail functions. We have mostly clients at 2.10.7 the lnet routers at 2.13.0 and lustre storage at 2.12.3 and 2.12.4, and some really old lustre FS we are going to decommission running 2.5.34. Mike

          That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message.

          The address resolution error is only seen when net logging is turned on.

          ashehata Amir Shehata (Inactive) added a comment - That message has been reduced in severity as I have indicated in a previous comment. So eventually when you upgrade, you won't see it anymore. If the hosts coming up/down is expected, then there shouldn't be a problem, except for the noisiness of this message. The address resolution error is only seen when net logging is turned on.

          Hi,
          Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn't cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won't be found in the /var/log/messages I assume.
          Mike

          [root@holy2c18110 ~]# ifconfig ib0
          ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
          inet 10.31.167.172 netmask 255.255.240.0 broadcast 10.31.175.255
          inet6 fe80::ee0d:9a03:12:8991 prefixlen 64 scopeid 0x20<link>
          Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
          infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 1024 (InfiniBand)
          RX packets 687361 bytes 39139716 (37.3 MiB)
          RX errors 0 dropped 0 overruns 0 frame 0
          TX packets 7705 bytes 654574 (639.2 KiB)
          TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

          [root@holy2c18110 ~]# ibstat
          CA 'mlx4_0'
          CA type: MT4099
          Number of ports: 2
          Firmware version: 2.42.5000
          Hardware version: 1
          Node GUID: 0xec0d9a0300128990
          System image GUID: 0xec0d9a0300128993
          Port 1:
          State: Active
          Physical state: LinkUp
          Rate: 56
          Base lid: 832
          LMC: 0
          SM lid: 158
          Capability mask: 0x02514868
          Port GUID: 0xec0d9a0300128991
          Link layer: InfiniBand
          Port 2:
          State: Down
          Physical state: Polling
          Rate: 10
          Base lid: 0
          LMC: 0
          SM lid: 0
          Capability mask: 0x02514868
          Port GUID: 0xec0d9a0300128992
          Link layer: InfiniBand
          [root@holy2c18110 ~]# uptime
          17:24:26 up 1 day, 7:05, 1 user, load average: 0.00, 0.01, 0.05

          mre64 Michael Ethier (Inactive) added a comment - Hi, Those 2 nodes are valid. One is up and looks to be running fine, the other is down. 10.31.167.172 is up and running for over 1 day see below (it could have been down when I collected the lustre log), the other one 10.31.176.98 is down. So any host out there that is down is going to cause these messages ? We will probably have some hosts out there down at any one time. We have hundreds of nodes on our 2 fabrics. These lnet routers talk to both fabrics and are the bridges between our FDR and HDR fabrics. This situation shouldn't cause big issues on our lnet ? Is that ADDR ERROR -110 in the lustre debug log and it won't be found in the /var/log/messages I assume. Mike [root@holy2c18110 ~] # ifconfig ib0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 inet 10.31.167.172 netmask 255.255.240.0 broadcast 10.31.175.255 inet6 fe80::ee0d:9a03:12:8991 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 1024 (InfiniBand) RX packets 687361 bytes 39139716 (37.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 7705 bytes 654574 (639.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@holy2c18110 ~] # ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xec0d9a0300128990 System image GUID: 0xec0d9a0300128993 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 832 LMC: 0 SM lid: 158 Capability mask: 0x02514868 Port GUID: 0xec0d9a0300128991 Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02514868 Port GUID: 0xec0d9a0300128992 Link layer: InfiniBand [root@holy2c18110 ~] # uptime 17:24:26 up 1 day, 7:05, 1 user, load average: 0.00, 0.01, 0.05

          So it looks like there are a couple of nodes which are causing all the problems:

          00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed
          00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed

          Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message.

          The reason connection to these NIDs are failing is due to:

          (o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110

          We're trying to resolve them and we're timing out.

          Are these nodes "real"? Are they left over configuration?

          ashehata Amir Shehata (Inactive) added a comment - So it looks like there are a couple of nodes which are causing all the problems: 00000800:00000100:2.0:1584491283.682847:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.176.98@o2ib4: connection failed 00000800:00000100:2.0:1584491283.691935:0:110122:0:(o2iblnd_cb.c:2289:kiblnd_peer_connect_failed()) Deleting messages for 10.31.167.172@o2ib: connection failed Whenever we try to connect to these peers they fail. The code assumes the reason for the failure is local, so it puts the local NI: 10.31.179.178@o2ib4 into recovery and that's when you see the message. The reason connection to these NIDs are failing is due to: (o2iblnd_cb.c:3174:kiblnd_cm_callback()) 10.31.176.98@o2ib4: ADDR ERROR -110 We're trying to resolve them and we're timing out. Are these nodes "real"? Are they left over configuration?

          Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ?
          Thanks,
          Mike

          mre64 Michael Ethier (Inactive) added a comment - Hello, have you had a chance to look at the log file to see if you see the cause of the ongoing messages ? Thanks, Mike

          Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below.

          Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max)
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
          Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages
          Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread...

          lustre.log.gz

          mre64 Michael Ethier (Inactive) added a comment - Hi I still have the messages with the 100 setting so i captured the lustre.log when the message occurred. Its attached. The messages occurred at Mar 17 20:28:03, see below. Mar 17 20:27:26 cannonlnet07 kernel: Lustre: debug daemon will attempt to start writing to /root/lustre.dk (512000kB max) Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900 Mar 17 20:28:03 cannonlnet07 kernel: LNetError: 110122:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 26 previous similar messages Mar 17 20:28:06 cannonlnet07 kernel: Lustre: shutting down debug daemon thread... lustre.log.gz

          People

            ashehata Amir Shehata (Inactive)
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: