[LU-12381] o2iblnd uses wrong IB interface Created: 04/Jun/19 Updated: 03/Jul/19 Resolved: 17/Jun/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Somehow, o2iblnd in latest master branch (225e7b8) is looking for wrong IB interface. # cat /etc/modprobe.d/lustre.conf options lnet networks="o2ib0(ib0)" # modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down they are trying to use 'eth0' insetad of ib0. eth0 is notihng configured for o2iblnd though. Jun 4 03:10:10 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jun 4 03:10:10 sv160 kernel: alg: No test for adler32 (adler32-zlib) Jun 4 03:10:11 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty Jun 4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2892:kiblnd_create_dev()) Can't query IPoIB interface eth0: it's down Jun 4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2944:kiblnd_create_dev()) LIBCFS: free NULL 'dev' (352 bytes) at /tmp/rpmbuild-lustre-root-KMkogdPP/BUILD/lustre-2.12.53_117_g20fe8e6_dirty/lnet/klnds/o2iblnd/o2iblnd.c:2944 Jun 4 03:10:12 sv160 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib Jun 4 03:10:12 sv160 kernel: LustreError: 3057:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed |
| Comments |
| Comment by Shuichi Ihara [ 04/Jun/19 ] |
|
here is ifconfig output. [root@sv160 lustre-release]# ifconfig -a
enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.36.5.160 netmask 255.255.240.0 broadcast 10.36.15.255
inet6 fe80::93ff:fe2e:34ae prefixlen 64 scopeid 0x20<link>
ether 02:00:93:2e:34:ae txqueuelen 1000 (Ethernet)
RX packets 2696139 bytes 686667727 (654.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3690693 bytes 3414372140 (3.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 52:54:00:12:34:56 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.16.254.160 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::9a03:9b03:74:df48 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 17874 bytes 3318650 (3.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 18930 bytes 1188418 (1.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 48825 bytes 406567982 (387.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 48825 bytes 406567982 (387.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
disabled eth0 competely, then loaded ko2iblnd module. that works. it looks lustre is trying to use first available interface regardless it's IPoIB and defined modprobe.conf? [root@sv160 lustre-release]# ifconfig -a
enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.36.5.160 netmask 255.255.240.0 broadcast 10.36.15.255
inet6 fe80::93ff:fe2e:34ae prefixlen 64 scopeid 0x20<link>
ether 02:00:93:2e:34:ae txqueuelen 1000 (Ethernet)
RX packets 2706039 bytes 687428990 (655.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3693103 bytes 3414586308 (3.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.16.254.160 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::9a03:9b03:74:df48 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 18167 bytes 3348538 (3.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 19223 bytes 1205998 (1.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 49893 bytes 406646134 (387.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 49893 bytes 406646134 (387.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
# modprobe lustre # tail /var/log/messages Jun 4 04:44:42 sv160 kernel: LNet: Removed LNI 172.16.254.160@o2ib Jun 4 04:46:34 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jun 4 04:46:34 sv160 kernel: alg: No test for adler32 (adler32-zlib) Jun 4 04:46:35 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty Jun 4 04:46:35 sv160 kernel: LNet: Using FastReg for registration Jun 4 04:46:35 sv160 kernel: LNet: Added LNI 172.16.254.160@o2ib [8/256/0/180] |
| Comment by Peter Jones [ 05/Jun/19 ] |
|
Sonia Could you please investigate? Peter |
| Comment by James A Simmons [ 05/Jun/19 ] |
|
Actually I know exactly what the bug is. I see what I did wrong and I know what the fix is. Peter can I take the ticket. I can fix it with https://review.whamcloud.com/#/c/34993/ |
| Comment by Peter Jones [ 05/Jun/19 ] |
|
Of course you can take the ticket! Do you know how long this has been broken? Is it only master broken or does this impact b2_12 at all? |
| Comment by James A Simmons [ 05/Jun/19 ] |
|
Just master but we need a bunch of fixes for 2.12 for proper IP alias support. |
| Comment by Peter Jones [ 07/Jun/19 ] |
|
Can we revert the broken patch in the meantime? |
| Comment by James A Simmons [ 07/Jun/19 ] |
|
Actually I can push a one line change to fix this if this is needed. |
| Comment by Gerrit Updater [ 07/Jun/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/35098 |
| Comment by Peter Jones [ 07/Jun/19 ] |
|
Thanks James! |
| Comment by Gerrit Updater [ 16/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35098/ |
| Comment by James A Simmons [ 17/Jun/19 ] |
|
Fixed. |
| Comment by Gerrit Updater [ 17/Jun/19 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35249 |
| Comment by Gerrit Updater [ 03/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35249/ |