[LU-11893] doesn't handle logical network interface properly. Created: 28/Jan/19 Updated: 07/May/20 Resolved: 07/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.12 |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
# ifconfig | grep ib Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) ib1:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) Lustre-2.10.5 works well # cat /etc/modprobe.d/lustre.conf options lnet networks="o2ib0(ib0), o2ib1(ib0:0), o2ib2(ib1), o2ib3(ib1:0)" # modprobe lustre Jan 28 12:52:17 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jan 28 12:52:17 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib) Jan 28 12:52:18 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.10.5_ddn7_2_g7fd8383 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.252.20@o2ib1 [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.253.20@o2ib2 [8/256/0/180] Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.254.20@o2ib3 [8/256/0/180] lustre-2.12 doesn't handle logical interface properly # modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down Jan 28 13:00:56 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4 Jan 28 13:00:56 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib) Jan 28 13:00:57 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.12.0 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180] Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(lib-socket.c:105:lnet_ipif_query()) Can't get flags for interface ib0:0 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(o2iblnd.c:2879:kiblnd_create_dev()) Can't query IPoIB interface ib0:0: -19 Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib Jan 28 13:00:58 ai200-7f94-vm00 kernel: LNet: Removed LNI 172.16.251.20@o2ib Jan 28 13:00:58 ai200-7f94-vm00 kernel: LustreError: 6305:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed |
| Comments |
| Comment by James A Simmons [ 28/Jan/19 ] |
|
This was also reported under ticket |
| Comment by Shuichi Ihara [ 03/Feb/19 ] |
|
which exact patch under |
| Comment by James A Simmons [ 03/Feb/19 ] |
|
The last patch which I haven't pushed yet which will completely remove lnet_sock_ioctl(). Before to do that I need to remove all the uses of lnet_ipif_enumerate() which doesn't work in 4.18 kernels (RHEL8). I expect all this will be back ported to 2.12 LTS. |
| Comment by Shuichi Ihara [ 04/Feb/19 ] |
|
OK, I would change priority to major since this is a compatibility issue of LNET configuration if server or client have been using logical interfaces. |
| Comment by James A Simmons [ 12/Feb/19 ] |
|
Try the following patches with current master: |
| Comment by Shuichi Ihara [ 09/Mar/19 ] |
|
Hello James, # ifconfig ib0:0 10.0.100.184 netmask 255.240.0.0 # ifconfig ib0:1 10.0.101.184 netmask 255.240.0.0 # echo 'options lnet networks="o2ib10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down [86757.579401] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [86757.583579] alg: No test for adler32 (adler32-zlib) [86758.393116] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [86758.527410] LNetError: 314255:0:(o2iblnd.c:2883:kiblnd_create_dev()) Can't find IPoIB interface ib0:0 [86759.526610] LNetError: 105-4: Error -100 starting up LNI o2ib [86759.526954] LustreError: 314255:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed socklnd is still problem. # echo 'options lnet networks="tcp10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre # modprobe: ERROR: could not insert 'lustre': Network is down [87053.811185] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [87053.815151] alg: No test for adler32 (adler32-zlib) [87054.628463] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [87054.726405] LNetError: 314402:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces [87054.726457] LNetError: 314402:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface ib0:0 info: -2 [87055.726386] LNetError: 105-4: Error -100 starting up LNI tcp [87055.726698] LustreError: 314402:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed |
| Comment by Shuichi Ihara [ 09/Mar/19 ] |
|
this is not IPoIB interface problem specifically, but also normal ethernet adapter is still same problem. # ifconfig eno1:0 10.128.11.184 netmask 255.255.248.0 # ifconfig eno1:1 10.128.12.184 netmask 255.255.248.0 # echo 'options lnet networks="tcp10(eno1:0,eno1:1)"' > /etc/modprobe.d/lustre.conf # lustre_rmmod ; modprobe lustre modprobe: ERROR: could not insert 'lustre': Network is down [87853.150375] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2 [87853.154186] alg: No test for adler32 (adler32-zlib) [87853.962152] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659 [87854.057590] LNetError: 314552:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces [87854.057651] LNetError: 314552:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface eno1:0 info: -2 [87855.057635] LNetError: 105-4: Error -100 starting up LNI tcp [87855.057984] LustreError: 314552:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed |
| Comment by James A Simmons [ 11/Mar/19 ] |
|
So you are using the pre DLC method. First what has been happening is that the code has been moving to have each LND driver to handle the interface mapping. In the ksocklnd case it was assumes in non Multi-Rail setup that the default is just one interface and if the module parameter "use_tcp_bonding" is enabled then map all interfaces to defined net. For the ko2iblnd driver in the MR case it doesn't even handle multiple interfaces |
| Comment by Gerrit Updater [ 11/Mar/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34392 |
| Comment by Shuichi Ihara [ 11/Mar/19 ] |
|
is patch https://review.whamcloud.com/34392 against https://review.whamcloud.com/#/c/33968 and https://review.whamcloud.com/#/c/34234 ? patch 34392 conflicts though in this case. |
| Comment by James A Simmons [ 11/Mar/19 ] |
|
Patch https://review.whamcloud.com/#/c/34392 is the base patch. The reset need to be rebased. Patch 34392 might just be what is needed to resolve this bug. Give it a try by itself. |
| Comment by James A Simmons [ 11/Mar/19 ] |
|
I have updated both: https://review.whamcloud.com/#/c/34392 https://review.whamcloud.com/#/c/33968 I have tested the above combo and it works for socklnd. Haven't tried o2iblnd just yet. |
| Comment by Shuichi Ihara [ 12/Mar/19 ] |
|
Hi James, |
| Comment by James A Simmons [ 20/Mar/19 ] |
|
I updated patch https://review.whamcloud.com/#/c/34392/ so everything should work now, including ko2iblnd. Please try it out. I have been testing on my end. |
| Comment by Gerrit Updater [ 20/Mar/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34476 |
| Comment by James A Simmons [ 20/Mar/19 ] |
|
Amir asked me to break up the patch. So two patches exist to address this issue. Later patches will be done to unify what is being done which also was requested. |
| Comment by James A Simmons [ 21/May/19 ] |
|
Please review https://review.whamcloud.com/#/c/34392. Its blocking RHEL8 support. |
| Comment by Gerrit Updater [ 29/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34392/ |
| Comment by Gerrit Updater [ 29/May/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34993 |
| Comment by James A Simmons [ 29/May/19 ] |
|
Two patches left. |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34476/ |
| Comment by Peter Jones [ 01/Jun/19 ] |
|
The countdown continues- one to go |
| Comment by Gerrit Updater [ 11/Jun/19 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35159 |
| Comment by Gerrit Updater [ 17/Jun/19 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35248 |
| Comment by James A Simmons [ 21/Jun/19 ] |
|
I think I resolved the ip2net string issues. |
| Comment by James A Simmons [ 24/Jun/19 ] |
|
I got positive feedback from Chris Horn. Looks like https://review.whamcloud.com/#/c/34993 is ready for reviews. |
| Comment by Gerrit Updater [ 27/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35159/ |
| Comment by Gerrit Updater [ 03/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35248/ |
| Comment by Gerrit Updater [ 08/Jul/19 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35442 |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34993/ |
| Comment by Gerrit Updater [ 20/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35442/ |