[LU-12381] o2iblnd uses wrong IB interface Created: 04/Jun/19  Updated: 03/Jul/19  Resolved: 17/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None
Environment:

master


Issue Links:
Duplicate
is duplicated by LU-12413 Lustre don't able to start if one int... Resolved
Related
is related to LU-11893 doesn't handle logical network interf... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Somehow, o2iblnd in latest master branch (225e7b8) is looking for wrong IB interface.

# cat /etc/modprobe.d/lustre.conf 
options lnet networks="o2ib0(ib0)"

# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Network is down

they are trying to use 'eth0' insetad of ib0. eth0 is notihng configured for o2iblnd though.

Jun  4 03:10:10 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
Jun  4 03:10:10 sv160 kernel: alg: No test for adler32 (adler32-zlib)
Jun  4 03:10:11 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty
Jun  4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2892:kiblnd_create_dev()) Can't query IPoIB interface eth0: it's down
Jun  4 03:10:11 sv160 kernel: LNetError: 3057:0:(o2iblnd.c:2944:kiblnd_create_dev()) LIBCFS: free NULL 'dev' (352 bytes) at /tmp/rpmbuild-lustre-root-KMkogdPP/BUILD/lustre-2.12.53_117_g20fe8e6_dirty/lnet/klnds/o2iblnd/o2iblnd.c:2944
Jun  4 03:10:12 sv160 kernel: LNetError: 105-4: Error -100 starting up LNI o2ib
Jun  4 03:10:12 sv160 kernel: LustreError: 3057:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed


 Comments   
Comment by Shuichi Ihara [ 04/Jun/19 ]

here is ifconfig output.

[root@sv160 lustre-release]# ifconfig -a
enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.36.5.160  netmask 255.255.240.0  broadcast 10.36.15.255
        inet6 fe80::93ff:fe2e:34ae  prefixlen 64  scopeid 0x20<link>
        ether 02:00:93:2e:34:ae  txqueuelen 1000  (Ethernet)
        RX packets 2696139  bytes 686667727 (654.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3690693  bytes 3414372140 (3.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 52:54:00:12:34:56  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 172.16.254.160  netmask 255.255.0.0  broadcast 172.16.255.255
        inet6 fe80::9a03:9b03:74:df48  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 17874  bytes 3318650 (3.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 18930  bytes 1188418 (1.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 48825  bytes 406567982 (387.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 48825  bytes 406567982 (387.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

disabled eth0 competely, then loaded ko2iblnd module. that works. it looks lustre is trying to use first available interface regardless it's IPoIB and defined modprobe.conf?

[root@sv160 lustre-release]# ifconfig -a
enp0s5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.36.5.160  netmask 255.255.240.0  broadcast 10.36.15.255
        inet6 fe80::93ff:fe2e:34ae  prefixlen 64  scopeid 0x20<link>
        ether 02:00:93:2e:34:ae  txqueuelen 1000  (Ethernet)
        RX packets 2706039  bytes 687428990 (655.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3693103  bytes 3414586308 (3.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 172.16.254.160  netmask 255.255.0.0  broadcast 172.16.255.255
        inet6 fe80::9a03:9b03:74:df48  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:10:86:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 18167  bytes 3348538 (3.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19223  bytes 1205998 (1.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 49893  bytes 406646134 (387.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 49893  bytes 406646134 (387.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# modprobe lustre
# tail /var/log/messages
Jun  4 04:44:42 sv160 kernel: LNet: Removed LNI 172.16.254.160@o2ib
Jun  4 04:46:34 sv160 kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4
Jun  4 04:46:34 sv160 kernel: alg: No test for adler32 (adler32-zlib)
Jun  4 04:46:35 sv160 kernel: Lustre: Lustre: Build Version: 2.12.53_117_g20fe8e6_dirty
Jun  4 04:46:35 sv160 kernel: LNet: Using FastReg for registration
Jun  4 04:46:35 sv160 kernel: LNet: Added LNI 172.16.254.160@o2ib [8/256/0/180]
Comment by Peter Jones [ 05/Jun/19 ]

Sonia

Could you please investigate?

Peter

Comment by James A Simmons [ 05/Jun/19 ]

Actually I know exactly what the bug is. I see what I did wrong and I know what the fix is. Peter can I take the ticket. I can fix it with https://review.whamcloud.com/#/c/34993/

Comment by Peter Jones [ 05/Jun/19 ]

Of course you can take the ticket! Do you know how long this has been broken? Is it only master broken or does this impact b2_12 at all?

Comment by James A Simmons [ 05/Jun/19 ]

Just master but we need a bunch of fixes for 2.12 for proper IP alias support.

Comment by Peter Jones [ 07/Jun/19 ]

Can we revert the broken patch in the meantime?

Comment by James A Simmons [ 07/Jun/19 ]

Actually I can push a one line change to fix this if this is needed.

Comment by Gerrit Updater [ 07/Jun/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/35098
Subject: LU-12381 ko2iblnd: ignore non IB devices
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a28aa551ddc39a52d73cff4383b10471c528dbaa

Comment by Peter Jones [ 07/Jun/19 ]

Thanks James!

Comment by Gerrit Updater [ 16/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35098/
Subject: LU-12381 ko2iblnd: ignore down interfaces
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1dea5aac9d9be99c4b317a491f308872b97bf0e6

Comment by James A Simmons [ 17/Jun/19 ]

Fixed.

Comment by Gerrit Updater [ 17/Jun/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35249
Subject: LU-12381 ko2iblnd: ignore down interfaces
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3ba56d464e2f784ad18a518f10dd14eb3cd5ab7d

Comment by Gerrit Updater [ 03/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35249/
Subject: LU-12381 ko2iblnd: ignore down interfaces
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 707a350138976a588edfa1250d368b328465c619

Generated at Sat Feb 10 02:52:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.