Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12824

Unable to add single Infiniband interface to multiple o2ib LNets

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Configuring a single IB interface on multiple LNets was broken by

      commit 75ab841d92a7109cf9f4da69a58ae4d21d360a4c
      Author: James Simmons <jsimmons@infradead.org>
      Date:   Mon Jul 8 10:42:47 2019 -0700
      
         LU-11893 lnet: consoldate secondary IP address handling
      

      Prior to this commit, when configuring an ib device for multiple LNets, we would only create a single struct ib_dev object. This object was created via a call to kiblnd_create_dev(). That function initializes the ib_dev object with a call to kiblnd_dev_failover(). kiblnd_dev_failover() creates the struct rdma_cm_id object, and calls rdma_bind_addr(). When the ib_dev object is created successfully, it is added to a global list of devices:

              list_add_tail(&dev->ibd_list,
                                &kiblnd_data.kib_devs);
      

      When the interface is added to additional LNets, the kiblnd_startup() routine searches the kiblnd_data.kib_devs list to see if there is an existing ib_dev object for the interface being configured. If it finds one, then that ib_dev object is re-used.
      The LU-11893 patch I noted above removed the logic for searching this list for an existing ib_dev object. It always creates a new ib_dev object, which I believe results in the EADDRINUSE.
      It should be pretty straight forward to re-introduce the logic for searching the kib_devs list.

      Reproducer with kernel module parameter:

      [root@snx11922n002 ~]# cat /etc/lustre/ip2nets.dat
      o2ib040(ib0) 10.12.0.*;
      o2ib041(ib0) 10.12.0.50;
      [root@snx11922n002 ~]# modprobe lnet
      l[root@snx11922n002 ~]# lctl net up
      LNET configure error 100: Network is down
      [root@snx11922n002 ~]# dmesg | tail
      [604327.506043] alg: No test for adler32 (adler32-zlib)
      [604327.512517] alg: No test for crc32 (crc32-table)
      [604328.280286] LNet: live_router_check_interval and dead_router_check_interval have been deprecated. Use alive_router_check_interval instead. Ignoring these deprecated parameters.
      [604330.561491] LNet: 3809:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth2: it's down
      [604330.591143] LNet: Using FastReg for registration
      [604330.614353] LNet: Added LNI 10.12.0.50@o2ib40 [16/2048/0/0]
      [604330.621410] LNetError: 3809:0:(o2iblnd.c:2776:kiblnd_dev_failover()) Failed to bind ib0:10.12.0.50 to device(ffff881f96ff8000): -98
      [604330.636010] LNetError: 3809:0:(o2iblnd.c:3266:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -98
      [604330.647163] LNetError: 105-4: Error -100 starting up LNI o2ib
      [604331.659240] LNet: Removed LNI 10.12.0.50@o2ib40
      

      Reproducer with lnetctl:

      [root@snx11922n002 ~]# modprobe lnet
      [root@snx11922n002 ~]# lctl mark mark
      [root@snx11922n002 ~]# lnetctl lnet configure
      [root@snx11922n002 ~]# lnetctl net add --net o2ib040 --if ib0
      [root@snx11922n002 ~]# lnetctl net add --net o2ib041 --if ib0
      add:
          - net:
                errno: -100
                descr: "cannot add network: Network is down"
      [root@snx11922n002 ~]# dmesg | tail
      [604760.221364] alg: No test for crc32 (crc32-table)
      [604760.983433] LNet: live_router_check_interval and dead_router_check_interval have been deprecated. Use alive_router_check_interval instead. Ignoring these deprecated parameters.
      [604763.557036] Lustre: DEBUG MARKER: mark
      [604777.372005] LNet: 7487:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth2: it's down
      [604777.382924] LNet: Using FastReg for registration
      [604777.402400] LNet: Added LNI 10.12.0.50@o2ib40 [16/2048/0/0]
      [604781.025699] LNet: 7528:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth2: it's down
      [604781.036209] LNetError: 7528:0:(o2iblnd.c:2776:kiblnd_dev_failover()) Failed to bind ib0:10.12.0.50 to device(ffff881f96ff8000): -98
      [604781.050103] LNetError: 7528:0:(o2iblnd.c:3266:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -98
      [604781.060933] LNetError: 105-4: Error -100 starting up LNI o2ib
      [root@snx11922n002 ~]#
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: