[LU-14638] crash after lnet_inet_enumerate() failed Created: 23/Apr/21  Updated: 24/Apr/21  Resolved: 24/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Task Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

I tried starting 2.14.0 in my local VM (just llmount.sh on kernel 3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 after a clean build) and twice got a crash after lnet_inet_enumerate() reported a down interface:

[52067.917105] LNet: 8128:0:(config.c:1565:lnet_inet_enumerate()) lnet: Ignoring interface : it's down
[52067.926616] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168
[52067.933558] IP: [<ffffffff8464f1f6>] dev_get_flags+0x6/0x70
[52068.014929] CPU: 1 PID: 8128 Comm: insmod Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 #1
[52068.024565] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[52068.098650] Call Trace:
[52068.101291]  [<ffffffffc0727149>] ? lnet_inet_enumerate+0x59/0x2d0 [lnet]
[52068.112194]  [<ffffffffc07ce9ea>] ksocknal_startup+0x12a/0xf90 [ksocklnd]
[52068.117704]  [<ffffffffc0721de5>] lnet_startup_lndnet+0x135/0x800 [lnet]
[52068.123269]  [<ffffffffc0724085>] LNetNIInit+0x735/0xcf0 [lnet]
[52068.133225]  [<ffffffffc0b3d1aa>] ptlrpc_ni_init+0x2a/0x1a0 [ptlrpc]
[52068.138628]  [<ffffffffc0b3d331>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
[52068.144344]  [<ffffffffc0d131ae>] ptlrpc_init+0x1ae/0x1000 [ptlrpc]
[52068.148379]  [<ffffffff8400210a>] do_one_initcall+0xba/0x240
[52068.153886]  [<ffffffff8411e62a>] load_module+0x271a/0x2bb0

It would be useful for the lnet_inet_enumerate() message to print which interface is down, but I see looking at the code that it is trying to do that but the interface name must be empty. Using '%s' around the name would make that more clear. Adding a bit of extra debugging shows that it is failing right away, without checking any other interfaces:

[ 1297.032731] LNet: 4540:0:(config.c:1560:lnet_inet_enumerate()) lnet: checking interface ''
[ 1297.040231] LNet: 4540:0:(config.c:1566:lnet_inet_enumerate()) lnet: Ignoring interface '': it's down
[ 1297.048456] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168

It seems like it is somehow trying to start with an empty device list, but the VM definitely has an interface that is up (I was logged into the VM via SSH when running the llmount.sh command) and I haven't had any issues running other releases (I don't recall if I've ever run vanilla 2.14.0 in this VM):

[   38.761576] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   38.771906] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready
[   38.781866] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready
:
:
# ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.10.99  netmask 255.255.255.0  broadcast 192.168.10.255
        inet6 fe80::e9c4:7d8c:e641:5e6e  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:1d:4b:97  txqueuelen 1000  (Ethernet)
        RX packets 1594  bytes 272650 (266.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 441  bytes 79324 (77.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp0s3:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.20.99  netmask 255.255.255.0  broadcast 192.168.20.255
        ether 08:00:27:1d:4b:97  txqueuelen 1000  (Ethernet)

There are no lnet or socklnd module options in use:

# cat /etc/modprobe.d/lustre.conf
options mdt max_mod_rpcs_per_client=16
options ptlrpc at_min=10 at_max=900

I'll try with the tip of master next (v2_14_51-85-ga2b5290d4284) in case this has already been fixed, but filing this ticket to capture details while I have them, and in case anyone else running vanilla 2.14.0 has the same problem it will provide breadcrumbs to find the fix.



 Comments   
Comment by James A Simmons [ 23/Apr/21 ]

I noticed you have an IPv4 and IPv6 addresses assigned to enp0s3. Can you try removing the inet6 address  and see if it stops crashing.

Comment by Andreas Dilger [ 24/Apr/21 ]

After doing a full clean build I am not able to reproduce this, with it without IPv6 addresses of the interfaces.

Generated at Sat Feb 10 03:11:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.