[LU-5884] bad lnet conf causes LBUG Created: 07/Nov/14 Updated: 20/Feb/15 Resolved: 10/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Bob Glossman (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
el7 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 16452 | ||||||||||||||||||||
| Description |
|
Having a bad lnet config file in /etc/modprobe.d can cause kernel LBUGs. In particular specifying by name a network interface that doesn't exist causes LBUG and panic at lnet startup time. At the very least this sort of thing should fail nicely and report errors that an admin can act on, it shouldn't panic the node. This was seen in our test environment when testing el7. Our test framework installs an /etc/modprobe.d/lustre-lnet.conf file that says: options lnet accept=all networks="tcp0(eth0)" accept_port=7988 This has always worked in the past, but in current el7 installs 'eth0' is no longer the default name of the primary ethernet interface. This causes all lnet startups to LBUG and panic, with traces like: 14:29:49:[ 207.450923] LNetError: 2024:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0 14:29:49:[ 207.451763] LNetError: 2024:0:(socklnd.c:2829:ksocknal_startup()) Can't get interface eth0 info: -19 14:29:49:[ 208.452162] LNetError: 105-4: Error -100 starting up LNI tcp 14:29:49:[ 208.453865] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed: 14:29:49:[ 208.456181] LNetError: 2024:0:(api-ni.c:829:lnet_unprepare()) LBUG 14:29:49:[ 208.456661] Pid: 2024, comm: modprobe 14:29:49:[ 208.456947] 14:29:49:[ 208.456947] Call Trace: 14:29:49:[ 208.457281] [<ffffffffa0432853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 14:29:49:[ 208.457816] [<ffffffffa0432df5>] lbug_with_loc+0x45/0xc0 [libcfs] 14:29:49:[ 208.458309] [<ffffffffa04d4877>] lnet_unprepare+0x297/0x340 [lnet] 14:29:49:[ 208.458784] [<ffffffffa04d749e>] LNetNIInit+0x30e/0xa50 [lnet] 14:29:49:[ 208.459271] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.459771] [<ffffffffa07d5f4c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc] 14:29:49:[ 208.460301] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.460800] [<ffffffffa07d60d1>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc] 14:29:49:[ 208.461346] [<ffffffffa08dd000>] ? init_module+0x0/0x1000 [ptlrpc] 14:29:49:[ 208.461841] [<ffffffffa08dd187>] init_module+0x187/0x1000 [ptlrpc] 14:29:49:[ 208.462338] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190 14:29:49:[ 208.462778] [<ffffffff810ca9cb>] load_module+0x12ab/0x1aa0 14:29:49:[ 208.463229] [<ffffffff812da1a0>] ? ddebug_dyndbg_module_param_cb+0x0/0x60 14:29:49:[ 208.463751] [<ffffffff810c72f3>] ? copy_module_from_fd.isra.43+0x53/0x150 14:29:49:[ 208.464287] [<ffffffff810cb376>] SyS_finit_module+0xa6/0xd0 14:29:49:[ 208.464721] [<ffffffff815f2b19>] system_call_fastpath+0x16/0x1b 14:29:49:[ 208.465189] 14:29:49:[ 208.466970] Kernel panic - not syncing: LBUG 14:29:49:[ 208.467019] CPU: 0 PID: 2024 Comm: modprobe Tainted: GF O-------------- 3.10.0-123.9.2.el7.x86_64 #1 We can probably take action in TEI to avoid this problem for el7 test, but it highlights the fact that a user or admin can crash nodes with reasonable looking but incorrect lnet config options. Such wrong config should return actionable errors, not cause LBUGs. |
| Comments |
| Comment by Andreas Dilger [ 07/Nov/14 ] |
|
Bob, does this problem exist in Lustre 2.6 or earlier? Trying to figure out if this is caused by DLC or is an old bug that we've never noticed. |
| Comment by Bob Glossman (Inactive) [ 07/Nov/14 ] |
|
Andreas, I just don't know the answer. I suspect that this is a long standing issue, not new. I will check back on old versions to try to find out. |
| Comment by Bob Glossman (Inactive) [ 07/Nov/14 ] |
|
Don't see the problem in b2_6. If I do a client mount with a bad config as shown above I get an error reported, no panic: # mount -t lustre -o flock,user_xattr centos2:/lustre /mnt/lustre mount.lustre: mount centos2:/lustre at /mnt/lustre failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems /var/log/messages says: Nov 7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(linux-tcpip.c:127:libcfs_ipif_query()) Can't get flags for interface eth0 Nov 7 12:19:03 centos7-2 kernel: LNetError: 103301:0:(socklnd.c:2826:ksocknal_startup()) Can't get interface eth0 info: -19 Nov 7 12:19:04 centos7-2 kernel: LNetError: 105-4: Error -100 starting up LNI tcp Nov 7 12:19:04 centos7-2 kernel: LustreError: 103301:0:(events.c:809:ptlrpc_init_portals()) network initialisation failed Nov 7 12:19:04 centos7-2 kernel: LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded? Nov 7 12:19:04 centos7-2 kernel: LustreError: 103272:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount (-19) This suggests the problem went into lnet code recently. |
| Comment by Liang Zhen (Inactive) [ 08/Nov/14 ] |
|
I believe this is just another instance of |
| Comment by Andreas Dilger [ 10/Nov/14 ] |
|
Close as duplicate of |