Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11893

doesn't handle logical network interface properly.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None
    • 2.12
    • 3
    • 9223372036854775807

    Description

       # ifconfig | grep ib
       Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
       infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
       ib0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
       infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
       ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
       infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
       ib1:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
       infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
      

      Lustre-2.10.5 works well

       # cat /etc/modprobe.d/lustre.conf 
       options lnet networks="o2ib0(ib0), o2ib1(ib0:0), o2ib2(ib1), o2ib3(ib1:0)"
       # modprobe lustre
      
       Jan 28 12:52:17 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
       Jan 28 12:52:17 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib)
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.10.5_ddn7_2_g7fd8383
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180]
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.252.20@o2ib1 [8/256/0/180]
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.253.20@o2ib2 [8/256/0/180]
       Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.254.20@o2ib3 [8/256/0/180]
      

      lustre-2.12 doesn't handle logical interface properly

      # modprobe lustre
      modprobe: ERROR: could not insert 'lustre': Network is down
      
      Jan 28 13:00:56 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
      Jan 28 13:00:56 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib)
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.12.0
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180]
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(lib-socket.c:105:lnet_ipif_query()) Can't get flags for interface ib0:0
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(o2iblnd.c:2879:kiblnd_create_dev()) Can't query IPoIB interface ib0:0: -19
      Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib
      Jan 28 13:00:58 ai200-7f94-vm00 kernel: LNet: Removed LNI 172.16.251.20@o2ib
      Jan 28 13:00:58 ai200-7f94-vm00 kernel: LustreError: 6305:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
      

      Attachments

        Issue Links

          Activity

            [LU-11893] doesn't handle logical network interface properly.

            I have updated both:

            https://review.whamcloud.com/#/c/34392

            https://review.whamcloud.com/#/c/33968

            I have tested the above combo and it works for socklnd. Haven't tried o2iblnd just yet.

            simmonsja James A Simmons added a comment - I have updated both: https://review.whamcloud.com/#/c/34392 https://review.whamcloud.com/#/c/33968 I have tested the above combo and it works for socklnd. Haven't tried o2iblnd just yet.
            simmonsja James A Simmons added a comment - - edited

            Patch https://review.whamcloud.com/#/c/34392 is the base patch. The reset need to be rebased. Patch 34392 might just be what is needed to resolve this bug. Give it a try by itself.

            simmonsja James A Simmons added a comment - - edited Patch https://review.whamcloud.com/#/c/34392 is the base patch. The reset need to be rebased. Patch 34392 might just be what is needed to resolve this bug. Give it a try by itself.

            is patch https://review.whamcloud.com/34392 against https://review.whamcloud.com/#/c/33968 and https://review.whamcloud.com/#/c/34234 ? patch 34392 conflicts though in this case.
            I just did quick test only patch 34392, but still didn't help. I will collect debug log to see what is going on.

            sihara Shuichi Ihara added a comment - is patch https://review.whamcloud.com/34392 against https://review.whamcloud.com/#/c/33968 and https://review.whamcloud.com/#/c/34234 ? patch 34392 conflicts though in this case. I just did quick test only patch 34392, but still didn't help. I will collect debug log to see what is going on.

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34392
            Subject: LU-11893 lnet: add secondary IP address handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d87840dcf8f42fe84bc08c7364df52efc0716121

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34392 Subject: LU-11893 lnet: add secondary IP address handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d87840dcf8f42fe84bc08c7364df52efc0716121

            So you are using the pre DLC method. First what has been happening is that the code has been moving to have each LND driver to handle the interface mapping. In the ksocklnd case it was assumes in non Multi-Rail setup that the default is just one interface and if the module parameter "use_tcp_bonding" is enabled then map all interfaces to defined net. For the ko2iblnd driver in the MR case it doesn't even handle multiple interfaces  For ksocklnd Multi-Rail the user can specify which interfaces to use. So their are general bugs all over the place for this stuff.

            simmonsja James A Simmons added a comment - So you are using the pre DLC method. First what has been happening is that the code has been moving to have each LND driver to handle the interface mapping. In the ksocklnd case it was assumes in non Multi-Rail setup that the default is just one interface and if the module parameter "use_tcp_bonding" is enabled then map all interfaces to defined net. For the ko2iblnd driver in the MR case it doesn't even handle multiple interfaces  For ksocklnd Multi-Rail the user can specify which interfaces to use. So their are general bugs all over the place for this stuff.

            this is not IPoIB interface problem specifically, but also normal ethernet adapter is still same problem.

            # ifconfig eno1:0 10.128.11.184 netmask 255.255.248.0
            # ifconfig eno1:1 10.128.12.184 netmask 255.255.248.0
            # echo  'options lnet networks="tcp10(eno1:0,eno1:1)"' > /etc/modprobe.d/lustre.conf
            # lustre_rmmod ; modprobe lustre
            modprobe: ERROR: could not insert 'lustre': Network is down
            
            [87853.150375] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
            [87853.154186] alg: No test for adler32 (adler32-zlib)
            [87853.962152] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
            [87854.057590] LNetError: 314552:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
            [87854.057651] LNetError: 314552:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface eno1:0 info: -2
            [87855.057635] LNetError: 105-4: Error -100 starting up LNI tcp
            [87855.057984] LustreError: 314552:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
            
            sihara Shuichi Ihara added a comment - this is not IPoIB interface problem specifically, but also normal ethernet adapter is still same problem. # ifconfig eno1:0 10.128.11.184 netmask 255.255.248.0 # ifconfig eno1:1 10.128.12.184 netmask 255.255.248.0 # echo 'options lnet networks="tcp10(eno1:0,eno1:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down [87853.150375] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [87853.154186] alg: No test for adler32 (adler32-zlib) [87853.962152] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [87854.057590] LNetError: 314552:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces [87854.057651] LNetError: 314552:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface eno1:0 info: -2 [87855.057635] LNetError: 105-4: Error -100 starting up LNI tcp [87855.057984] LustreError: 314552:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
            sihara Shuichi Ihara added a comment - - edited

            Hello James,
            patch still didn't help on logical inetrface with IPoIB. o2iblnd

            # ifconfig ib0:0 10.0.100.184 netmask 255.240.0.0
            # ifconfig ib0:1 10.0.101.184 netmask 255.240.0.0
            # echo  'options lnet networks="o2ib10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf
            # lustre_rmmod ; modprobe lustre
            modprobe: ERROR: could not insert 'lustre': Network is down
            
            [86757.579401] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
            [86757.583579] alg: No test for adler32 (adler32-zlib)
            [86758.393116] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
            [86758.527410] LNetError: 314255:0:(o2iblnd.c:2883:kiblnd_create_dev()) Can't find IPoIB interface ib0:0
            [86759.526610] LNetError: 105-4: Error -100 starting up LNI o2ib
            [86759.526954] LustreError: 314255:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
            

            socklnd is still problem.

            # echo  'options lnet networks="tcp10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf
            # lustre_rmmod ; modprobe lustre
            # modprobe: ERROR: could not insert 'lustre': Network is down
            
            [87053.811185] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
            [87053.815151] alg: No test for adler32 (adler32-zlib)
            [87054.628463] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
            [87054.726405] LNetError: 314402:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
            [87054.726457] LNetError: 314402:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface ib0:0 info: -2
            [87055.726386] LNetError: 105-4: Error -100 starting up LNI tcp
            [87055.726698] LustreError: 314402:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
            
            sihara Shuichi Ihara added a comment - - edited Hello James, patch still didn't help on logical inetrface with IPoIB. o2iblnd # ifconfig ib0:0 10.0.100.184 netmask 255.240.0.0 # ifconfig ib0:1 10.0.101.184 netmask 255.240.0.0 # echo 'options lnet networks="o2ib10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down [86757.579401] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [86757.583579] alg: No test for adler32 (adler32-zlib) [86758.393116] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [86758.527410] LNetError: 314255:0:(o2iblnd.c:2883:kiblnd_create_dev()) Can't find IPoIB interface ib0:0 [86759.526610] LNetError: 105-4: Error -100 starting up LNI o2ib [86759.526954] LustreError: 314255:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed socklnd is still problem. # echo 'options lnet networks="tcp10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre # modprobe: ERROR: could not insert 'lustre': Network is down [87053.811185] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [87053.815151] alg: No test for adler32 (adler32-zlib) [87054.628463] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [87054.726405] LNetError: 314402:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces [87054.726457] LNetError: 314402:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface ib0:0 info: -2 [87055.726386] LNetError: 105-4: Error -100 starting up LNI tcp [87055.726698] LustreError: 314402:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
            simmonsja James A Simmons added a comment - Try the following patches with current master: https://review.whamcloud.com/#/c/33968 https://review.whamcloud.com/#/c/34234

            OK, I would change priority to major since this is a compatibility issue of LNET configuration if server or client have been using logical interfaces.

            sihara Shuichi Ihara added a comment - OK, I would change priority to major since this is a compatibility issue of LNET configuration if server or client have been using logical interfaces.

            The last patch which I haven't pushed yet which will completely remove lnet_sock_ioctl(). Before to do that I need to remove all the uses of lnet_ipif_enumerate() which doesn't work in 4.18 kernels (RHEL8). I expect all this will be back ported to 2.12 LTS.

            simmonsja James A Simmons added a comment - The last patch which I haven't pushed yet which will completely remove lnet_sock_ioctl(). Before to do that I need to remove all the uses of lnet_ipif_enumerate() which doesn't work in 4.18 kernels (RHEL8). I expect all this will be back ported to 2.12 LTS.

            which exact patch under LU-11838 could solve problem? the number of LU-11838 patches were already landed in master and I've tested latest master (263e80f), but still didn't help.

            sihara Shuichi Ihara added a comment - which exact patch under LU-11838 could solve problem? the number of LU-11838 patches were already landed in master and I've tested latest master (263e80f), but still didn't help.

            People

              simmonsja James A Simmons
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: