[LU-11893] doesn't handle logical network interface properly. Created: 28/Jan/19  Updated: 07/May/20  Resolved: 07/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.12


Issue Links:
Related
is related to LU-6399 Libcfs socket cleanup Resolved
is related to LU-12269 Support RHEL 8.0 Resolved
is related to LU-12381 o2iblnd uses wrong IB interface Resolved
is related to LU-12511 Prepare lustre for adoption into the ... Open
is related to LU-11838 Support linux kernel version 4.18 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
 # ifconfig | grep ib
 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
 ib0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
 infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
 ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
 ib1:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
 infiniband 20:00:18:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

Lustre-2.10.5 works well

 # cat /etc/modprobe.d/lustre.conf 
 options lnet networks="o2ib0(ib0), o2ib1(ib0:0), o2ib2(ib1), o2ib3(ib1:0)"
 # modprobe lustre
 Jan 28 12:52:17 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
 Jan 28 12:52:17 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib)
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.10.5_ddn7_2_g7fd8383
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180]
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.252.20@o2ib1 [8/256/0/180]
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.253.20@o2ib2 [8/256/0/180]
 Jan 28 12:52:18 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.254.20@o2ib3 [8/256/0/180]

lustre-2.12 doesn't handle logical interface properly

# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Network is down
Jan 28 13:00:56 ai200-7f94-vm00 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
Jan 28 13:00:56 ai200-7f94-vm00 kernel: alg: No test for adler32 (adler32-zlib)
Jan 28 13:00:57 ai200-7f94-vm00 kernel: Lustre: Lustre: Build Version: 2.12.0
Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Using FastReg for registration
Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNet: Added LNI 172.16.251.20@o2ib [8/256/0/180]
Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(lib-socket.c:105:lnet_ipif_query()) Can't get flags for interface ib0:0
Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 6305:0:(o2iblnd.c:2879:kiblnd_create_dev()) Can't query IPoIB interface ib0:0: -19
Jan 28 13:00:57 ai200-7f94-vm00 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib
Jan 28 13:00:58 ai200-7f94-vm00 kernel: LNet: Removed LNI 172.16.251.20@o2ib
Jan 28 13:00:58 ai200-7f94-vm00 kernel: LustreError: 6305:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed


 Comments   
Comment by James A Simmons [ 28/Jan/19 ]

This was also reported under ticket LU-6399. The work being done under LU-11838 to support 4.18 kernels will resolve this bug.

Comment by Shuichi Ihara [ 03/Feb/19 ]

which exact patch under LU-11838 could solve problem? the number of LU-11838 patches were already landed in master and I've tested latest master (263e80f), but still didn't help.

Comment by James A Simmons [ 03/Feb/19 ]

The last patch which I haven't pushed yet which will completely remove lnet_sock_ioctl(). Before to do that I need to remove all the uses of lnet_ipif_enumerate() which doesn't work in 4.18 kernels (RHEL8). I expect all this will be back ported to 2.12 LTS.

Comment by Shuichi Ihara [ 04/Feb/19 ]

OK, I would change priority to major since this is a compatibility issue of LNET configuration if server or client have been using logical interfaces.

Comment by James A Simmons [ 12/Feb/19 ]

Try the following patches with current master:

https://review.whamcloud.com/#/c/33968

https://review.whamcloud.com/#/c/34234

Comment by Shuichi Ihara [ 09/Mar/19 ]

Hello James,
patch still didn't help on logical inetrface with IPoIB. o2iblnd

# ifconfig ib0:0 10.0.100.184 netmask 255.240.0.0
# ifconfig ib0:1 10.0.101.184 netmask 255.240.0.0
# echo  'options lnet networks="o2ib10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf
# lustre_rmmod ; modprobe lustre
modprobe: ERROR: could not insert 'lustre': Network is down

[86757.579401] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[86757.583579] alg: No test for adler32 (adler32-zlib)
[86758.393116] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
[86758.527410] LNetError: 314255:0:(o2iblnd.c:2883:kiblnd_create_dev()) Can't find IPoIB interface ib0:0
[86759.526610] LNetError: 105-4: Error -100 starting up LNI o2ib
[86759.526954] LustreError: 314255:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed

socklnd is still problem.

# echo  'options lnet networks="tcp10(ib0:0,ib0:1)"' > /etc/modprobe.d/lustre.conf
# lustre_rmmod ; modprobe lustre
# modprobe: ERROR: could not insert 'lustre': Network is down

[87053.811185] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[87053.815151] alg: No test for adler32 (adler32-zlib)
[87054.628463] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
[87054.726405] LNetError: 314402:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
[87054.726457] LNetError: 314402:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface ib0:0 info: -2
[87055.726386] LNetError: 105-4: Error -100 starting up LNI tcp
[87055.726698] LustreError: 314402:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
Comment by Shuichi Ihara [ 09/Mar/19 ]

this is not IPoIB interface problem specifically, but also normal ethernet adapter is still same problem.

# ifconfig eno1:0 10.128.11.184 netmask 255.255.248.0
# ifconfig eno1:1 10.128.12.184 netmask 255.255.248.0
# echo  'options lnet networks="tcp10(eno1:0,eno1:1)"' > /etc/modprobe.d/lustre.conf
# lustre_rmmod ; modprobe lustre
modprobe: ERROR: could not insert 'lustre': Network is down

[87853.150375] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[87853.154186] alg: No test for adler32 (adler32-zlib)
[87853.962152] Lustre: Lustre: Build Version: 2.12.51_100_g3c4a659
[87854.057590] LNetError: 314552:0:(socklnd.c:2610:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
[87854.057651] LNetError: 314552:0:(socklnd.c:2828:ksocknal_startup()) Can't get interface eno1:0 info: -2
[87855.057635] LNetError: 105-4: Error -100 starting up LNI tcp
[87855.057984] LustreError: 314552:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
Comment by James A Simmons [ 11/Mar/19 ]

So you are using the pre DLC method. First what has been happening is that the code has been moving to have each LND driver to handle the interface mapping. In the ksocklnd case it was assumes in non Multi-Rail setup that the default is just one interface and if the module parameter "use_tcp_bonding" is enabled then map all interfaces to defined net. For the ko2iblnd driver in the MR case it doesn't even handle multiple interfaces  For ksocklnd Multi-Rail the user can specify which interfaces to use. So their are general bugs all over the place for this stuff.

Comment by Gerrit Updater [ 11/Mar/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34392
Subject: LU-11893 lnet: add secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d87840dcf8f42fe84bc08c7364df52efc0716121

Comment by Shuichi Ihara [ 11/Mar/19 ]

is patch https://review.whamcloud.com/34392 against https://review.whamcloud.com/#/c/33968 and https://review.whamcloud.com/#/c/34234 ? patch 34392 conflicts though in this case.
I just did quick test only patch 34392, but still didn't help. I will collect debug log to see what is going on.

Comment by James A Simmons [ 11/Mar/19 ]

Patch https://review.whamcloud.com/#/c/34392 is the base patch. The reset need to be rebased. Patch 34392 might just be what is needed to resolve this bug. Give it a try by itself.

Comment by James A Simmons [ 11/Mar/19 ]

I have updated both:

https://review.whamcloud.com/#/c/34392

https://review.whamcloud.com/#/c/33968

I have tested the above combo and it works for socklnd. Haven't tried o2iblnd just yet.

Comment by Shuichi Ihara [ 12/Mar/19 ]

Hi James,
Thanks!. I also confirmed patch works for socklnd, but o2nld still didn't work. I think it needs similar idea of https://review.whamcloud.com/#/c/33968 for o2lnd.

Comment by James A Simmons [ 20/Mar/19 ]

I updated patch https://review.whamcloud.com/#/c/34392/ so everything should work now, including ko2iblnd. Please try it out. I have been testing on my end.

Comment by Gerrit Updater [ 20/Mar/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34476
Subject: LU-11893 o2iblnd: add secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8fce282ce0de93e54d373d64e275d161025063f4

Comment by James A Simmons [ 20/Mar/19 ]

Amir asked me to break up the patch. So two patches exist to address this issue. Later patches will be done to unify what is being done which also was requested.

Comment by James A Simmons [ 21/May/19 ]

Please review https://review.whamcloud.com/#/c/34392. Its blocking RHEL8 support.

Comment by Gerrit Updater [ 29/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34392/
Subject: LU-11893 ksocklnd: add secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9a2013af0668737dc56424c5c6eaac01621f6c17

Comment by Gerrit Updater [ 29/May/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34993
Subject: LU-11893 lnet: consoldate secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0e53ce7c9c31d37fc6514608843a6049e9167ddd

Comment by James A Simmons [ 29/May/19 ]

Two patches left.

Comment by Gerrit Updater [ 01/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34476/
Subject: LU-11893 o2iblnd: add secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c4b39bf56bbcacd49d7f888a0745cd4b5580b36b

Comment by Peter Jones [ 01/Jun/19 ]

The countdown continues- one to go

Comment by Gerrit Updater [ 11/Jun/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35159
Subject: LU-11893 ksocklnd: add secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 9f98099692cf4bfc18000226ac09bee3be2d6e74

Comment by Gerrit Updater [ 17/Jun/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35248
Subject: LU-11893 o2iblnd: add secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3034ad8fdb7fda499ded48313ab9f0479d063188

Comment by James A Simmons [ 21/Jun/19 ]

I think I resolved the ip2net string issues.

Comment by James A Simmons [ 24/Jun/19 ]

I got positive feedback from Chris Horn. Looks like https://review.whamcloud.com/#/c/34993 is ready for reviews.

Comment by Gerrit Updater [ 27/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35159/
Subject: LU-11893 ksocklnd: add secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: eb46e3374193895b01fe16a8975553c87133a52e

Comment by Gerrit Updater [ 03/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35248/
Subject: LU-11893 o2iblnd: add secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 4d863145dd63d387e138bb20dcf2d5f1b66a52aa

Comment by Gerrit Updater [ 08/Jul/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35442
Subject: LU-11893 lnet: consoldate secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3d3ae9946b67c66ee9ce8a7e916d899e8a2b7197

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34993/
Subject: LU-11893 lnet: consoldate secondary IP address handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b770d7117f35a972bd2c9ffef03a17dbcb036d20

Comment by Gerrit Updater [ 20/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35442/
Subject: LU-11893 lnet: consoldate secondary IP address handling
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 19fdb52725360d4233f6e13de9f399f344a15109

Generated at Sat Feb 10 02:47:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.