Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.7.0
-
None
-
el7
-
3
-
16452
Description
Having a bad lnet config file in /etc/modprobe.d can cause kernel LBUGs. In particular specifying by name a network interface that doesn't exist causes LBUG and panic at lnet startup time. At the very least this sort of thing should fail nicely and report errors that an admin can act on, it shouldn't panic the node.
This was seen in our test environment when testing el7. Our test framework installs an /etc/modprobe.d/lustre-lnet.conf file that says:
options lnet accept=all networks="tcp0(eth0)" accept_port=7988
This has always worked in the past, but in current el7 installs 'eth0' is no longer the default name of the primary ethernet interface. This causes all lnet startups to LBUG and panic, with traces like:
14:29:49:[ 207.450923] LNetError: 2024:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0 14:29:49:[ 207.451763] LNetError: 2024:0:(socklnd.c:2829:ksocknal_startup()) Can't get interface eth0 info: -19 14:29:49:[ 208.452162] LNetError: 105-4: Error -100 starting up LNI tcp 14:29:49:[ 208.453865] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed: 14:29:49:[ 208.456181] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) LBUG 14:29:49:[ 208.456661] Pid: 2024, comm: modprobe 14:29:49:[ 208.456947] 14:29:49:[ 208.456947] Call Trace: 14:29:49:[ 208.457281] [<ffffffffa0432853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 14:29:49:[ 208.457816] [<ffffffffa0432df5>] lbug_with_loc+0x45/0xc0 [libcfs] 14:29:49:[ 208.458309] [<ffffffffa04d4877>] lnet_unprepare+0x297/0x340 [lnet] 14:29:49:[ 208.458784] [<ffffffffa04d749e>] LNetNIInit+0x30e/0xa50 [lnet] 14:29:49:[ 208.459271] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.459771] [<ffffffffa07d5f4c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc] 14:29:49:[ 208.460301] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.460800] [<ffffffffa07d60d1>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc] 14:29:49:[ 208.461346] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.461841] [<ffffffffa08dd187>] init_module+0x187/0x1000 [ptlrpc] 14:29:49:[ 208.462338] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190 14:29:49:[ 208.462778] [<ffffffff810ca9cb>] load_module+0x12ab/0x1aa0 14:29:49:[ 208.463229] [<ffffffff812da1a0>] ? ddebug_dyndbg_module_param_cb+0x0/0x60 14:29:49:[ 208.463751] [<ffffffff810c72f3>] ? copy_module_from_fd.isra.43+0x53/0x150 14:29:49:[ 208.464287] [<ffffffff810cb376>] SyS_finit_module+0xa6/0xd0 14:29:49:[ 208.464721] [<ffffffff815f2b19>] system_call_fastpath+0x16/0x1b 14:29:49:[ 208.465189] 14:29:49:[ 208.466970] Kernel panic - not syncing: LBUG 14:29:49:[ 208.467019] CPU: 0 PID: 2024 Comm: modprobe Tainted: GF O-------------- 3.10.0-123.9.2.el7.x86_64 #1
We can probably take action in TEI to avoid this problem for el7 test, but it highlights the fact that a user or admin can crash nodes with reasonable looking but incorrect lnet config options. Such wrong config should return actionable errors, not cause LBUGs.