Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
2.12
-
3
-
9223372036854775807
Description
# ifconfig | grep ib Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib1:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
Lustre-2.10.5 works well
# cat /etc/modprobe.d/lustre.conf options lnet networks="o2ib0(ib0), o2ib1(ib0:0), o2ib2(ib1), o2ib3(ib1:0)" # modprobe lustre
Jan 28 12:52:17 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jan 28 12:52:17 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib) Jan 28 12:52:18 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.10.5_ddn7_2_g7fd8383 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.252.20@o2ib1 [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.253.20@o2ib2 [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.254.20@o2ib3 [8/256/0/180]
lustre-2.12 doesn't handle logical interface properly
# modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down
Jan 28 13:00:56 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jan 28 13:00:56 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib) Jan 28 13:00:57 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.12.0 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180] Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(lib-socket.c:105:lnet_ipif_query()) Can't get flags for interface ib0:0 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(o2iblnd.c:2879:kiblnd_create_dev()) Can't query IPoIB interface ib0:0: -19 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib Jan 28 13:00:58 ai200-7f94-vm00 kernel: LNet: Removed LNI 172.16.251.20@o2ib Jan 28 13:00:58 ai200-7f94-vm00 kernel: LustreError: 6305:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
So you are using the pre DLC method. First what has been happening is that the code has been moving to have each LND driver to handle the interface mapping. In the ksocklnd case it was assumes in non Multi-Rail setup that the default is just one interface and if the module parameter "use_tcp_bonding" is enabled then map all interfaces to defined net. For the ko2iblnd driver in the MR case it doesn't even handle multiple interfaces
For ksocklnd Multi-Rail the user can specify which interfaces to use. So their are general bugs all over the place for this stuff.