[LU-5884] bad lnet conf causes LBUG Created: 07/Nov/14  Updated: 20/Feb/15  Resolved: 10/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bob Glossman (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

el7


Issue Links:
Duplicate
duplicates LU-5568 kernel crash when when network initia... Resolved
Related
is related to LU-5022 support for 3.10 rhel7 linux kernel Resolved
is related to LU-2456 Dynamic LNet Config Main Development ... Resolved
Severity: 3
Rank (Obsolete): 16452

 Description   

Having a bad lnet config file in /etc/modprobe.d can cause kernel LBUGs. In particular specifying by name a network interface that doesn't exist causes LBUG and panic at lnet startup time. At the very least this sort of thing should fail nicely and report errors that an admin can act on, it shouldn't panic the node.

This was seen in our test environment when testing el7. Our test framework installs an /etc/modprobe.d/lustre-lnet.conf file that says:

options lnet accept=all networks="tcp0(eth0)" accept_port=7988

This has always worked in the past, but in current el7 installs 'eth0' is no longer the default name of the primary ethernet interface. This causes all lnet startups to LBUG and panic, with traces like:

14:29:49:[  207.450923] LNetError: 2024:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0
14:29:49:[  207.451763] LNetError: 2024:0:(socklnd.c:2829:ksocknal_startup()) Can't get interface eth0 info: -19
14:29:49:[  208.452162] LNetError: 105-4: Error -100 starting up LNI tcp
14:29:49:[  208.453865] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed: 
14:29:49:[  208.456181] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) LBUG
14:29:49:[  208.456661] Pid: 2024, comm: modprobe
14:29:49:[  208.456947] 
14:29:49:[  208.456947] Call Trace:
14:29:49:[  208.457281]  [<ffffffffa0432853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
14:29:49:[  208.457816]  [<ffffffffa0432df5>] lbug_with_loc+0x45/0xc0 [libcfs]
14:29:49:[  208.458309]  [<ffffffffa04d4877>] lnet_unprepare+0x297/0x340 [lnet]
14:29:49:[  208.458784]  [<ffffffffa04d749e>] LNetNIInit+0x30e/0xa50 [lnet]
14:29:49:[  208.459271]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
14:29:49:[  208.459771]  [<ffffffffa07d5f4c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
14:29:49:[  208.460301]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
14:29:49:[  208.460800]  [<ffffffffa07d60d1>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
14:29:49:[  208.461346]  [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc]
14:29:49:[  208.461841]  [<ffffffffa08dd187>] init_module+0x187/0x1000 [ptlrpc]
14:29:49:[  208.462338]  [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
14:29:49:[  208.462778]  [<ffffffff810ca9cb>] load_module+0x12ab/0x1aa0
14:29:49:[  208.463229]  [<ffffffff812da1a0>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
14:29:49:[  208.463751]  [<ffffffff810c72f3>] ? copy_module_from_fd.isra.43+0x53/0x150
14:29:49:[  208.464287]  [<ffffffff810cb376>] SyS_finit_module+0xa6/0xd0
14:29:49:[  208.464721]  [<ffffffff815f2b19>] system_call_fastpath+0x16/0x1b
14:29:49:[  208.465189] 
14:29:49:[  208.466970] Kernel panic - not syncing: LBUG
14:29:49:[  208.467019] CPU: 0 PID: 2024 Comm: modprobe Tainted: GF          O--------------   3.10.0-123.9.2.el7.x86_64 #1

We can probably take action in TEI to avoid this problem for el7 test, but it highlights the fact that a user or admin can crash nodes with reasonable looking but incorrect lnet config options. Such wrong config should return actionable errors, not cause LBUGs.



 Comments   
Comment by Andreas Dilger [ 07/Nov/14 ]

Bob, does this problem exist in Lustre 2.6 or earlier? Trying to figure out if this is caused by DLC or is an old bug that we've never noticed.

Comment by Bob Glossman (Inactive) [ 07/Nov/14 ]

Andreas, I just don't know the answer. I suspect that this is a long standing issue, not new. I will check back on old versions to try to find out.

Comment by Bob Glossman (Inactive) [ 07/Nov/14 ]

Don't see the problem in b2_6. If I do a client mount with a bad config as shown above I get an error reported, no panic:

# mount -t lustre -o flock,user_xattr centos2:/lustre /mnt/lustre
mount.lustre: mount centos2:/lustre at /mnt/lustre failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems

/var/log/messages says:

Nov  7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0
Nov  7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(socklnd.c:2826:ksocknal_startup()) Can't get interface eth0 info: -19
Nov  7 12:19:04 centos7-2 kernel: LNetError: 105-4: Error -100 starting up LNI tcp
Nov  7 12:19:04 centos7-2 kernel: LustreError: 103301:0:(events.c:809:ptlrpc_init_portals()) network initialisation failed
Nov  7 12:19:04 centos7-2 kernel: LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded?
Nov  7 12:19:04 centos7-2 kernel: LustreError: 103272:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-19)

This suggests the problem went into lnet code recently.

Comment by Liang Zhen (Inactive) [ 08/Nov/14 ]

I believe this is just another instance of LU-5568

Comment by Andreas Dilger [ 10/Nov/14 ]

Close as duplicate of LU-5568.

Generated at Sat Feb 10 01:55:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.