Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13367

lnet_handle_local_failure messages every 10 min ?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • Lnet routers built with Lenovo hardware with lustre 2.13.0 installed. IB card installed is a 2 port Lenovo ConnectX-5. One port connected to FDR one port connected to HDR fabric.
    • 3
    • 9223372036854775807

    Description

      Hello,
      We have 8 lnet routers in production. We have set them up to using /etc/modprobe.d/lustre.conf as their configuration file. We are seeing many messages in the /var/log/messages about the FDR and HDR ib interfaces in recovery. The messages appear every 10 min it seems. Are these benign or are they serious ? I search and couldn't seem to find any answers. It seems the lnet routers are processing data, no one is complaining at this point. Here is a sample of the messages below and some other details. And insight would be appreciated as to their severity, and if its possible to fix the issue if there is one.
      Thanks,
      Mike

      Mar 17 13:17:18 cannonlnet07 kernel: LNetError: 84267:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:27:23 cannonlnet07 kernel: LNetError: 84537:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 13:37:23 cannonlnet07 kernel: LNetError: 85075:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:47:33 cannonlnet07 kernel: LNetError: 85903:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 13:57:48 cannonlnet07 kernel: LNetError: 86445:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.160.253@o2ib added to recovery queue. Health = 900
      Mar 17 14:07:58 cannonlnet07 kernel: LNetError: 87049:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 86 previous similar messages
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:18:03 cannonlnet07 kernel: LNetError: 87442:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 83 previous similar messages
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:28:13 cannonlnet07 kernel: LNetError: 88018:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 85 previous similar messages
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.31.179.178@o2ib4 added to recovery queue. Health = 900
      Mar 17 14:38:18 cannonlnet07 kernel: LNetError: 88683:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 84 previous similar messages

      [root@cannonlnet07 ~]# nslookup 10.31.160.253
      253.160.31.10.in-addr.arpa name = cannonlnet07-fdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# nslookup 10.31.179.178
      178.179.31.10.in-addr.arpa name = cannonlnet07-hdr-ib.rc.fas.harvard.edu.

      [root@cannonlnet07 ~]# more /etc/modprobe.d/lustre.conf
      options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
      options lnet forwarding="enabled"
      options lnet lnet_peer_discovery_disabled=1
      [root@cannonlnet07 ~]# lnetctl net show
      net:

      • net type: lo
        local NI(s):
      • nid: 0@lo
        status: up
      • net type: o2ib
        local NI(s):
      • nid: 10.31.160.253@o2ib
        status: up
        interfaces:
        0: ib1
      • net type: o2ib2
        local NI(s):
      • nid: 10.31.160.253@o2ib2
        status: up
        interfaces:
        0: ib1
      • net type: o2ib4
        local NI(s):
      • nid: 10.31.179.178@o2ib4
        status: up
        interfaces:
        0: ib0
      • net type: tcp
        local NI(s):
      • nid: 10.31.8.93@tcp
        status: down
        interfaces:
        0: bond0
      • net type: tcp4
        local NI(s):
      • nid: 10.31.73.39@tcp4
        status: down
        interfaces:
        0: bond0.2475
        [root@cannonlnet07 ~]# lnetctl stats show
        statistics:
        msgs_alloc: 1517
        msgs_max: 16396
        rst_alloc: 568
        errors: 0
        send_count: 9287639
        resend_count: 12378
        response_timeout_count: 28115
        local_interrupt_count: 0
        local_dropped_count: 24050
        local_aborted_count: 0
        local_no_route_count: 0
        local_timeout_count: 14188
        local_error_count: 0
        remote_dropped_count: 3862
        remote_error_count: 0
        remote_timeout_count: 0
        network_timeout_count: 0
        recv_count: 9287639
        route_count: 2744617426
        drop_count: 50252
        send_length: 1039854144
        recv_length: 283232
        route_length: 125943144442551
        drop_length: 24066088
        [root@cannonlnet07 ~]# lnetctl global show
        global:
        numa_range: 0
        max_intf: 200
        discovery: 0
        drop_asym_route: 0
        retry_count: 3
        transaction_timeout: 10
        health_sensitivity: 100
        recovery_interval: 1
        router_sensitivity: 100
        [root@cannonlnet07 ~]# ibstat
        CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de0
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 1422
        LMC: 0
        SM lid: 1434
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de0
        Link layer: InfiniBand
        CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.26.1040
        Hardware version: 0
        Node GUID: 0x98039b0300907de1
        System image GUID: 0x98039b0300907de0
        Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 2259
        LMC: 0
        SM lid: 158
        Capability mask: 0x2651e848
        Port GUID: 0x98039b0300907de1
        Link layer: InfiniBand
        [root@cannonlnet07 ~]# ifconfig ib0
        ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
        inet 10.31.179.178 netmask 255.255.240.0 broadcast 10.31.191.255
        inet6 fe80::9a03:9b03:90:7de0 prefixlen 64 scopeid 0x20<link>
        Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
        RX packets 343090 bytes 34310200 (32.7 MiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 112049 bytes 6723124 (6.4 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      [root@cannonlnet07 ~]# ifconfig ib1
      ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
      inet 10.31.160.253 netmask 255.255.240.0 broadcast 10.31.175.255
      inet6 fe80::9a03:9b03:90:7de1 prefixlen 64 scopeid 0x20<link>
      Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
      infiniband 20:00:19:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
      RX packets 495909 bytes 50980886 (48.6 MiB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 18846 bytes 1130904 (1.0 MiB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      Attachments

        1. lustre.log.gz
          34.16 MB
        2. lustre.log.gz
          34.16 MB

        Activity

          People

            ashehata Amir Shehata (Inactive)
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: