Details
-
Task
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
9223372036854775807
Description
I tried starting 2.14.0 in my local VM (just llmount.sh on kernel 3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 after a clean build) and twice got a crash after lnet_inet_enumerate() reported a down interface:
[52067.917105] LNet: 8128:0:(config.c:1565:lnet_inet_enumerate()) lnet: Ignoring interface : it's down [52067.926616] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168 [52067.933558] IP: [<ffffffff8464f1f6>] dev_get_flags+0x6/0x70 [52068.014929] CPU: 1 PID: 8128 Comm: insmod Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 #1 [52068.024565] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [52068.098650] Call Trace: [52068.101291] [<ffffffffc0727149>] ? lnet_inet_enumerate+0x59/0x2d0 [lnet] [52068.112194] [<ffffffffc07ce9ea>] ksocknal_startup+0x12a/0xf90 [ksocklnd] [52068.117704] [<ffffffffc0721de5>] lnet_startup_lndnet+0x135/0x800 [lnet] [52068.123269] [<ffffffffc0724085>] LNetNIInit+0x735/0xcf0 [lnet] [52068.133225] [<ffffffffc0b3d1aa>] ptlrpc_ni_init+0x2a/0x1a0 [ptlrpc] [52068.138628] [<ffffffffc0b3d331>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc] [52068.144344] [<ffffffffc0d131ae>] ptlrpc_init+0x1ae/0x1000 [ptlrpc] [52068.148379] [<ffffffff8400210a>] do_one_initcall+0xba/0x240 [52068.153886] [<ffffffff8411e62a>] load_module+0x271a/0x2bb0
It would be useful for the lnet_inet_enumerate() message to print which interface is down, but I see looking at the code that it is trying to do that but the interface name must be empty. Using '%s' around the name would make that more clear. Adding a bit of extra debugging shows that it is failing right away, without checking any other interfaces:
[ 1297.032731] LNet: 4540:0:(config.c:1560:lnet_inet_enumerate()) lnet: checking interface '' [ 1297.040231] LNet: 4540:0:(config.c:1566:lnet_inet_enumerate()) lnet: Ignoring interface '': it's down [ 1297.048456] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168
It seems like it is somehow trying to start with an empty device list, but the VM definitely has an interface that is up (I was logged into the VM via SSH when running the llmount.sh command) and I haven't had any issues running other releases (I don't recall if I've ever run vanilla 2.14.0 in this VM):
[ 38.761576] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 38.771906] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready
[ 38.781866] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready
:
:
# ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.10.99 netmask 255.255.255.0 broadcast 192.168.10.255
inet6 fe80::e9c4:7d8c:e641:5e6e prefixlen 64 scopeid 0x20<link>
ether 08:00:27:1d:4b:97 txqueuelen 1000 (Ethernet)
RX packets 1594 bytes 272650 (266.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 441 bytes 79324 (77.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp0s3:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.20.99 netmask 255.255.255.0 broadcast 192.168.20.255
ether 08:00:27:1d:4b:97 txqueuelen 1000 (Ethernet)
There are no lnet or socklnd module options in use:
# cat /etc/modprobe.d/lustre.conf options mdt max_mod_rpcs_per_client=16 options ptlrpc at_min=10 at_max=900
I'll try with the tip of master next (v2_14_51-85-ga2b5290d4284) in case this has already been fixed, but filing this ticket to capture details while I have them, and in case anyone else running vanilla 2.14.0 has the same problem it will provide breadcrumbs to find the fix.