Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14638

crash after lnet_inet_enumerate() failed

    XMLWordPrintable

Details

    • Task
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 9223372036854775807

    Description

      I tried starting 2.14.0 in my local VM (just llmount.sh on kernel 3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 after a clean build) and twice got a crash after lnet_inet_enumerate() reported a down interface:

      [52067.917105] LNet: 8128:0:(config.c:1565:lnet_inet_enumerate()) lnet: Ignoring interface : it's down
      [52067.926616] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168
      [52067.933558] IP: [<ffffffff8464f1f6>] dev_get_flags+0x6/0x70
      [52068.014929] CPU: 1 PID: 8128 Comm: insmod Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.21.1.el7_lustre.ddn13.x86_64 #1
      [52068.024565] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [52068.098650] Call Trace:
      [52068.101291]  [<ffffffffc0727149>] ? lnet_inet_enumerate+0x59/0x2d0 [lnet]
      [52068.112194]  [<ffffffffc07ce9ea>] ksocknal_startup+0x12a/0xf90 [ksocklnd]
      [52068.117704]  [<ffffffffc0721de5>] lnet_startup_lndnet+0x135/0x800 [lnet]
      [52068.123269]  [<ffffffffc0724085>] LNetNIInit+0x735/0xcf0 [lnet]
      [52068.133225]  [<ffffffffc0b3d1aa>] ptlrpc_ni_init+0x2a/0x1a0 [ptlrpc]
      [52068.138628]  [<ffffffffc0b3d331>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      [52068.144344]  [<ffffffffc0d131ae>] ptlrpc_init+0x1ae/0x1000 [ptlrpc]
      [52068.148379]  [<ffffffff8400210a>] do_one_initcall+0xba/0x240
      [52068.153886]  [<ffffffff8411e62a>] load_module+0x271a/0x2bb0
      

      It would be useful for the lnet_inet_enumerate() message to print which interface is down, but I see looking at the code that it is trying to do that but the interface name must be empty. Using '%s' around the name would make that more clear. Adding a bit of extra debugging shows that it is failing right away, without checking any other interfaces:

      [ 1297.032731] LNet: 4540:0:(config.c:1560:lnet_inet_enumerate()) lnet: checking interface ''
      [ 1297.040231] LNet: 4540:0:(config.c:1566:lnet_inet_enumerate()) lnet: Ignoring interface '': it's down
      [ 1297.048456] BUG: unable to handle kernel NULL pointer dereference at 0000000000000168
      

      It seems like it is somehow trying to start with an empty device list, but the VM definitely has an interface that is up (I was logged into the VM via SSH when running the llmount.sh command) and I haven't had any issues running other releases (I don't recall if I've ever run vanilla 2.14.0 in this VM):

      [   38.761576] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
      [   38.771906] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready
      [   38.781866] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready
      :
      :
      # ifconfig
      enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 192.168.10.99  netmask 255.255.255.0  broadcast 192.168.10.255
              inet6 fe80::e9c4:7d8c:e641:5e6e  prefixlen 64  scopeid 0x20<link>
              ether 08:00:27:1d:4b:97  txqueuelen 1000  (Ethernet)
              RX packets 1594  bytes 272650 (266.2 KiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 441  bytes 79324 (77.4 KiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      enp0s3:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 192.168.20.99  netmask 255.255.255.0  broadcast 192.168.20.255
              ether 08:00:27:1d:4b:97  txqueuelen 1000  (Ethernet)
      

      There are no lnet or socklnd module options in use:

      # cat /etc/modprobe.d/lustre.conf
      options mdt max_mod_rpcs_per_client=16
      options ptlrpc at_min=10 at_max=900
      

      I'll try with the tip of master next (v2_14_51-85-ga2b5290d4284) in case this has already been fixed, but filing this ticket to capture details while I have them, and in case anyone else running vanilla 2.14.0 has the same problem it will provide breadcrumbs to find the fix.

      Attachments

        Activity

          People

            wc-triage WC Triage
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: