Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5884

bad lnet conf causes LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.7.0
    • None
    • el7
    • 3
    • 16452

    Description

      Having a bad lnet config file in /etc/modprobe.d can cause kernel LBUGs. In particular specifying by name a network interface that doesn't exist causes LBUG and panic at lnet startup time. At the very least this sort of thing should fail nicely and report errors that an admin can act on, it shouldn't panic the node.

      This was seen in our test environment when testing el7. Our test framework installs an /etc/modprobe.d/lustre-lnet.conf file that says:

      options lnet accept=all networks="tcp0(eth0)" accept_port=7988

      This has always worked in the past, but in current el7 installs 'eth0' is no longer the default name of the primary ethernet interface. This causes all lnet startups to LBUG and panic, with traces like:

      14:29:49:[  207.450923] LNetError: 2024:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0
      14:29:49:[  207.451763] LNetError: 2024:0:(socklnd.c:2829:ksocknal_startup()) Can't get interface eth0 info: -19
      14:29:49:[  208.452162] LNetError: 105-4: Error -100 starting up LNI tcp
      14:29:49:[  208.453865] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed: 
      14:29:49:[  208.456181] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) LBUG
      14:29:49:[  208.456661] Pid: 2024, comm: modprobe
      14:29:49:[  208.456947] 
      14:29:49:[  208.456947] Call Trace:
      14:29:49:[  208.457281]  [<ffffffffa0432853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      14:29:49:[  208.457816]  [<ffffffffa0432df5>] lbug_with_loc+0x45/0xc0 [libcfs]
      14:29:49:[  208.458309]  [<ffffffffa04d4877>] lnet_unprepare+0x297/0x340 [lnet]
      14:29:49:[  208.458784]  [<ffffffffa04d749e>] LNetNIInit+0x30e/0xa50 [lnet]
      14:29:49:[  208.459271]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.459771]  [<ffffffffa07d5f4c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      14:29:49:[  208.460301]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.460800]  [<ffffffffa07d60d1>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      14:29:49:[  208.461346]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.461841]  [<ffffffffa08dd187>] init_module+0x187/0x1000 [ptlrpc]
      14:29:49:[  208.462338]  [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      14:29:49:[  208.462778]  [<ffffffff810ca9cb>] load_module+0x12ab/0x1aa0
      14:29:49:[  208.463229]  [<ffffffff812da1a0>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
      14:29:49:[  208.463751]  [<ffffffff810c72f3>] ? copy_module_from_fd.isra.43+0x53/0x150
      14:29:49:[  208.464287]  [<ffffffff810cb376>] SyS_finit_module+0xa6/0xd0
      14:29:49:[  208.464721]  [<ffffffff815f2b19>] system_call_fastpath+0x16/0x1b
      14:29:49:[  208.465189] 
      14:29:49:[  208.466970] Kernel panic - not syncing: LBUG
      14:29:49:[  208.467019] CPU: 0 PID: 2024 Comm: modprobe Tainted: GF          O--------------   3.10.0-123.9.2.el7.x86_64 #1
      

      We can probably take action in TEI to avoid this problem for el7 test, but it highlights the fact that a user or admin can crash nodes with reasonable looking but incorrect lnet config options. Such wrong config should return actionable errors, not cause LBUGs.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              bogl Bob Glossman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: