Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.3
    • None
    • None
    • master
    • 3
    • 9223372036854775807

    Description

      Somehow, o2iblnd in latest master branch (225e7b8) is looking for wrong IB interface.

      # cat /etc/modprobe.d/lustre.conf 
      options lnet networks="o2ib0(ib0)"
      
      # modprobe lustre
      modprobe: ERROR: could not insert 'lustre': Network is down
      

      they are trying to use 'eth0' insetad of ib0. eth0 is notihng configured for o2iblnd though.

      Jun  4 03:10:10 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
      Jun  4 03:10:10 sv160 kernel: alg: No test for adler32 (adler32-zlib)
      Jun  4 03:10:11 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty
      Jun  4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2892:kiblnd_create_dev()) Can't query IPoIB interface eth0: it's down
      Jun  4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2944:kiblnd_create_dev()) LIBCFS: free NULL 'dev' (352 bytes) at /tmp/rpmbuild-lustre-root-KMkogdPP/BUILD/lustre-2.12.53_117_g20fe8e6_dirty/lnet/klnds/o2iblnd/o2iblnd.c:2944
      Jun  4 03:10:12 sv160 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib
      Jun  4 03:10:12 sv160 kernel: LustreError: 3057:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
      

      Attachments

        Issue Links

          Activity

            [LU-12381] o2iblnd uses wrong IB interface

            Fixed.

            simmonsja James A Simmons added a comment - Fixed.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35098/
            Subject: LU-12381 ko2iblnd: ignore down interfaces
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1dea5aac9d9be99c4b317a491f308872b97bf0e6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35098/ Subject: LU-12381 ko2iblnd: ignore down interfaces Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1dea5aac9d9be99c4b317a491f308872b97bf0e6
            pjones Peter Jones added a comment -

            Thanks James!

            pjones Peter Jones added a comment - Thanks James!

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/35098
            Subject: LU-12381 ko2iblnd: ignore non IB devices
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a28aa551ddc39a52d73cff4383b10471c528dbaa

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/35098 Subject: LU-12381 ko2iblnd: ignore non IB devices Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a28aa551ddc39a52d73cff4383b10471c528dbaa

            Actually I can push a one line change to fix this if this is needed.

            simmonsja James A Simmons added a comment - Actually I can push a one line change to fix this if this is needed.
            pjones Peter Jones added a comment -

            Can we revert the broken patch in the meantime?

            pjones Peter Jones added a comment - Can we revert the broken patch in the meantime?

            Just master but we need a bunch of fixes for 2.12 for proper IP alias support.

            simmonsja James A Simmons added a comment - Just master but we need a bunch of fixes for 2.12 for proper IP alias support.
            pjones Peter Jones added a comment -

            Of course you can take the ticket! Do you know how long this has been broken? Is it only master broken or does this impact b2_12 at all?

            pjones Peter Jones added a comment - Of course you can take the ticket! Do you know how long this has been broken? Is it only master broken or does this impact b2_12 at all?

            Actually I know exactly what the bug is. I see what I did wrong and I know what the fix is. Peter can I take the ticket. I can fix it with https://review.whamcloud.com/#/c/34993/

            simmonsja James A Simmons added a comment - Actually I know exactly what the bug is. I see what I did wrong and I know what the fix is. Peter can I take the ticket. I can fix it with  https://review.whamcloud.com/#/c/34993/
            pjones Peter Jones added a comment -

            Sonia

            Could you please investigate?

            Peter

            pjones Peter Jones added a comment - Sonia Could you please investigate? Peter

            here is ifconfig output.

            [root@sv160 lustre-release]# ifconfig -a
            enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
                    inet 10.36.5.160  netmask 255.255.240.0  broadcast 10.36.15.255
                    inet6 fe80::93ff:fe2e:34ae  prefixlen 64  scopeid 0x20<link>
                    ether 02:00:93:2e:34:ae  txqueuelen 1000  (Ethernet)
                    RX packets 2696139  bytes 686667727 (654.8 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 3690693  bytes 3414372140 (3.1 GiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            eth0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
                    ether 52:54:00:12:34:56  txqueuelen 1000  (Ethernet)
                    RX packets 0  bytes 0 (0.0 B)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 0  bytes 0 (0.0 B)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
                    inet 172.16.254.160  netmask 255.255.0.0  broadcast 172.16.255.255
                    inet6 fe80::9a03:9b03:74:df48  prefixlen 64  scopeid 0x20<link>
            Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
                    infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
                    RX packets 17874  bytes 3318650 (3.1 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 18930  bytes 1188418 (1.1 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
                    inet 127.0.0.1  netmask 255.0.0.0
                    inet6 ::1  prefixlen 128  scopeid 0x10<host>
                    loop  txqueuelen 1000  (Local Loopback)
                    RX packets 48825  bytes 406567982 (387.7 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 48825  bytes 406567982 (387.7 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            

            disabled eth0 competely, then loaded ko2iblnd module. that works. it looks lustre is trying to use first available interface regardless it's IPoIB and defined modprobe.conf?

            [root@sv160 lustre-release]# ifconfig -a
            enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
                    inet 10.36.5.160  netmask 255.255.240.0  broadcast 10.36.15.255
                    inet6 fe80::93ff:fe2e:34ae  prefixlen 64  scopeid 0x20<link>
                    ether 02:00:93:2e:34:ae  txqueuelen 1000  (Ethernet)
                    RX packets 2706039  bytes 687428990 (655.5 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 3693103  bytes 3414586308 (3.1 GiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
                    inet 172.16.254.160  netmask 255.255.0.0  broadcast 172.16.255.255
                    inet6 fe80::9a03:9b03:74:df48  prefixlen 64  scopeid 0x20<link>
            Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
                    infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
                    RX packets 18167  bytes 3348538 (3.1 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 19223  bytes 1205998 (1.1 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
                    inet 127.0.0.1  netmask 255.0.0.0
                    inet6 ::1  prefixlen 128  scopeid 0x10<host>
                    loop  txqueuelen 1000  (Local Loopback)
                    RX packets 49893  bytes 406646134 (387.8 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 49893  bytes 406646134 (387.8 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            
            # modprobe lustre
            # tail /var/log/messages
            Jun  4 04:44:42 sv160 kernel: LNet: Removed LNI 172.16.254.160@o2ib
            Jun  4 04:46:34 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
            Jun  4 04:46:34 sv160 kernel: alg: No test for adler32 (adler32-zlib)
            Jun  4 04:46:35 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty
            Jun  4 04:46:35 sv160 kernel: LNet: Using FastReg for registration
            Jun  4 04:46:35 sv160 kernel: LNet: Added LNI 172.16.254.160@o2ib [8/256/0/180]
            
            sihara Shuichi Ihara added a comment - here is ifconfig output. [root@sv160 lustre-release]# ifconfig -a enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.36.5.160 netmask 255.255.240.0 broadcast 10.36.15.255 inet6 fe80::93ff:fe2e:34ae prefixlen 64 scopeid 0x20<link> ether 02:00:93:2e:34:ae txqueuelen 1000 (Ethernet) RX packets 2696139 bytes 686667727 (654.8 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3690693 bytes 3414372140 (3.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4098<BROADCAST,MULTICAST> mtu 1500 ether 52:54:00:12:34:56 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 172.16.254.160 netmask 255.255.0.0 broadcast 172.16.255.255 inet6 fe80::9a03:9b03:74:df48 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) RX packets 17874 bytes 3318650 (3.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 18930 bytes 1188418 (1.1 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 48825 bytes 406567982 (387.7 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 48825 bytes 406567982 (387.7 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 disabled eth0 competely, then loaded ko2iblnd module. that works. it looks lustre is trying to use first available interface regardless it's IPoIB and defined modprobe.conf? [root@sv160 lustre-release]# ifconfig -a enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.36.5.160 netmask 255.255.240.0 broadcast 10.36.15.255 inet6 fe80::93ff:fe2e:34ae prefixlen 64 scopeid 0x20<link> ether 02:00:93:2e:34:ae txqueuelen 1000 (Ethernet) RX packets 2706039 bytes 687428990 (655.5 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3693103 bytes 3414586308 (3.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 172.16.254.160 netmask 255.255.0.0 broadcast 172.16.255.255 inet6 fe80::9a03:9b03:74:df48 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) RX packets 18167 bytes 3348538 (3.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 19223 bytes 1205998 (1.1 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 49893 bytes 406646134 (387.8 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 49893 bytes 406646134 (387.8 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # modprobe lustre # tail /var/log/messages Jun 4 04:44:42 sv160 kernel: LNet: Removed LNI 172.16.254.160@o2ib Jun 4 04:46:34 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jun 4 04:46:34 sv160 kernel: alg: No test for adler32 (adler32-zlib) Jun 4 04:46:35 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty Jun 4 04:46:35 sv160 kernel: LNet: Using FastReg for registration Jun 4 04:46:35 sv160 kernel: LNet: Added LNI 172.16.254.160@o2ib [8/256/0/180]

            People

              simmonsja James A Simmons
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: