Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.7.0
    • None
    • el7
    • 3
    • 16452

    Description

      Having a bad lnet config file in /etc/modprobe.d can cause kernel LBUGs. In particular specifying by name a network interface that doesn't exist causes LBUG and panic at lnet startup time. At the very least this sort of thing should fail nicely and report errors that an admin can act on, it shouldn't panic the node.

      This was seen in our test environment when testing el7. Our test framework installs an /etc/modprobe.d/lustre-lnet.conf file that says:

      options lnet accept=all networks="tcp0(eth0)" accept_port=7988

      This has always worked in the past, but in current el7 installs 'eth0' is no longer the default name of the primary ethernet interface. This causes all lnet startups to LBUG and panic, with traces like:

      14:29:49:[  207.450923] LNetError: 2024:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0
      14:29:49:[  207.451763] LNetError: 2024:0:(socklnd.c:2829:ksocknal_startup()) Can't get interface eth0 info: -19
      14:29:49:[  208.452162] LNetError: 105-4: Error -100 starting up LNI tcp
      14:29:49:[  208.453865] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed: 
      14:29:49:[  208.456181] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) LBUG
      14:29:49:[  208.456661] Pid: 2024, comm: modprobe
      14:29:49:[  208.456947] 
      14:29:49:[  208.456947] Call Trace:
      14:29:49:[  208.457281]  [<ffffffffa0432853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      14:29:49:[  208.457816]  [<ffffffffa0432df5>] lbug_with_loc+0x45/0xc0 [libcfs]
      14:29:49:[  208.458309]  [<ffffffffa04d4877>] lnet_unprepare+0x297/0x340 [lnet]
      14:29:49:[  208.458784]  [<ffffffffa04d749e>] LNetNIInit+0x30e/0xa50 [lnet]
      14:29:49:[  208.459271]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.459771]  [<ffffffffa07d5f4c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
      14:29:49:[  208.460301]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.460800]  [<ffffffffa07d60d1>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
      14:29:49:[  208.461346]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
      14:29:49:[  208.461841]  [<ffffffffa08dd187>] init_module+0x187/0x1000 [ptlrpc]
      14:29:49:[  208.462338]  [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
      14:29:49:[  208.462778]  [<ffffffff810ca9cb>] load_module+0x12ab/0x1aa0
      14:29:49:[  208.463229]  [<ffffffff812da1a0>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
      14:29:49:[  208.463751]  [<ffffffff810c72f3>] ? copy_module_from_fd.isra.43+0x53/0x150
      14:29:49:[  208.464287]  [<ffffffff810cb376>] SyS_finit_module+0xa6/0xd0
      14:29:49:[  208.464721]  [<ffffffff815f2b19>] system_call_fastpath+0x16/0x1b
      14:29:49:[  208.465189] 
      14:29:49:[  208.466970] Kernel panic - not syncing: LBUG
      14:29:49:[  208.467019] CPU: 0 PID: 2024 Comm: modprobe Tainted: GF          O--------------   3.10.0-123.9.2.el7.x86_64 #1
      

      We can probably take action in TEI to avoid this problem for el7 test, but it highlights the fact that a user or admin can crash nodes with reasonable looking but incorrect lnet config options. Such wrong config should return actionable errors, not cause LBUGs.

      Attachments

        Issue Links

          Activity

            [LU-5884] bad lnet conf causes LBUG

            Close as duplicate of LU-5568.

            adilger Andreas Dilger added a comment - Close as duplicate of LU-5568 .

            I believe this is just another instance of LU-5568

            liang Liang Zhen (Inactive) added a comment - I believe this is just another instance of LU-5568

            Don't see the problem in b2_6. If I do a client mount with a bad config as shown above I get an error reported, no panic:

            # mount -t lustre -o flock,user_xattr centos2:/lustre /mnt/lustre
            mount.lustre: mount centos2:/lustre at /mnt/lustre failed: No such device
            Are the lustre modules loaded?
            Check /etc/modprobe.conf and /proc/filesystems
            

            /var/log/messages says:

            Nov  7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0
            Nov  7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(socklnd.c:2826:ksocknal_startup()) Can't get interface eth0 info: -19
            Nov  7 12:19:04 centos7-2 kernel: LNetError: 105-4: Error -100 starting up LNI tcp
            Nov  7 12:19:04 centos7-2 kernel: LustreError: 103301:0:(events.c:809:ptlrpc_init_portals()) network initialisation failed
            Nov  7 12:19:04 centos7-2 kernel: LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded?
            Nov  7 12:19:04 centos7-2 kernel: LustreError: 103272:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-19)
            

            This suggests the problem went into lnet code recently.

            bogl Bob Glossman (Inactive) added a comment - Don't see the problem in b2_6. If I do a client mount with a bad config as shown above I get an error reported, no panic: # mount -t lustre -o flock,user_xattr centos2:/lustre /mnt/lustre mount.lustre: mount centos2:/lustre at /mnt/lustre failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems /var/log/messages says: Nov 7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0 Nov 7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(socklnd.c:2826:ksocknal_startup()) Can't get interface eth0 info: -19 Nov 7 12:19:04 centos7-2 kernel: LNetError: 105-4: Error -100 starting up LNI tcp Nov 7 12:19:04 centos7-2 kernel: LustreError: 103301:0:(events.c:809:ptlrpc_init_portals()) network initialisation failed Nov 7 12:19:04 centos7-2 kernel: LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded? Nov 7 12:19:04 centos7-2 kernel: LustreError: 103272:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount (-19) This suggests the problem went into lnet code recently.

            Andreas, I just don't know the answer. I suspect that this is a long standing issue, not new. I will check back on old versions to try to find out.

            bogl Bob Glossman (Inactive) added a comment - Andreas, I just don't know the answer. I suspect that this is a long standing issue, not new. I will check back on old versions to try to find out.

            Bob, does this problem exist in Lustre 2.6 or earlier? Trying to figure out if this is caused by DLC or is an old bug that we've never noticed.

            adilger Andreas Dilger added a comment - Bob, does this problem exist in Lustre 2.6 or earlier? Trying to figure out if this is caused by DLC or is an old bug that we've never noticed.

            People

              ashehata Amir Shehata (Inactive)
              bogl Bob Glossman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: