[LU-5568] kernel crash when when network initialization failed Created: 01/Sep/14  Updated: 17/Dec/14  Resolved: 17/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Wang Shilong (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, patch

Issue Links:
Duplicate
duplicates LU-5664 assertion in failure handling of LNet... Closed
is duplicated by LU-5884 bad lnet conf causes LBUG Resolved
Related
is related to LU-2456 Dynamic LNet Config Main Development ... Resolved
is related to LU-5734 LNet dynamic control: lnet_dyn_add_ni... Resolved
Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 15529

 Description   

105.976884] Lustre: Lustre: Build Version: v2_6_51_0-g16c7568-CHANGED-3.10.1.el7_lustre
[ 105.990490] LNetError: 2145:0:(socklnd.c:2660:ksocknal_enumerate_interfaces()) Can't find any usable interfaces
[ 106.990120] LNetError: 105-4: Error -100 starting up LNI tcp
[ 106.992703] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) ASSERTION( list_empty(&the_lnet.ln_nis) ) failed:
[ 106.994560] LNetError: 2145:0:(api-ni.c:823:lnet_unprepare()) LBUG
[ 106.994561] Pid: 2145, comm: modprobe
[ 106.994561] \x0aCall Trace:
[ 106.994574] [<ffffffffa044f853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[ 106.994578] [<ffffffffa044fdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
[ 106.994585] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
[ 106.994587] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
[ 106.994592] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
[ 106.994631] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[ 106.994658] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
[ 106.994679] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[ 106.994703] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
[ 106.994722] [<ffffffffa0983000>] ? init_module+0x0/0x1000 [ptlrpc]
[ 106.994739] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
[ 106.994742] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
[ 106.994744] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
[ 106.994745] [<ffffffff812da590>] ? ddebug_dyndbg_module_param_cb+0x0/0x60
[ 106.994747] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
[ 106.994748] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
[ 106.994750] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[ 106.994750]
[ 106.995032] Kernel panic - not syncing: LBUG
[ 106.995034] CPU: 3 PID: 2145 Comm: modprobe Tainted: GF O-------------- 3.10.1.el7_lustre #1
[ 106.995034] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 106.995036] ffffffffa0474d0d 00000000d711e588 ffff880036601bf0 ffffffff815e19ba
[ 106.995037] ffff880036601c70 ffffffff815db549 ffffffff00000008 ffff880036601c80
[ 106.995037] ffff880036601c20 00000000d711e588 ffffffffa051574f 0000000000000000
[ 106.995038] Call Trace:
[ 106.995052] [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[ 106.995055] [<ffffffff815db549>] panic+0xd8/0x1e7
[ 106.995062] [<ffffffffa044fe5b>] lbug_with_loc+0xab/0xc0 [libcfs]
[ 106.995067] [<ffffffffa04f3267>] lnet_unprepare+0x297/0x340 [lnet]
[ 106.995070] [<ffffffffa04f3b5c>] LNetNIInit+0x25c/0x3e0 [lnet]
[ 106.995072] [<ffffffff81061bc6>] ? put_online_cpus+0x56/0x80
[ 106.995088] [<ffffffffa0983000>] ? 0xffffffffa0982fff
[ 106.995113] [<ffffffffa081310c>] ptlrpc_ni_init+0x2c/0x1a0 [ptlrpc]
[ 106.995117] [<ffffffffa0983000>] ? 0xffffffffa0982fff
[ 106.995138] [<ffffffffa0813291>] ptlrpc_init_portals+0x11/0xf0 [ptlrpc]
[ 106.995141] [<ffffffffa0983000>] ? 0xffffffffa0982fff
[ 106.995179] [<ffffffffa09831c4>] init_module+0x1c4/0x1000 [ptlrpc]
[ 106.995181] [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
[ 106.995182] [<ffffffff810ca7fb>] load_module+0x129b/0x1a90
[ 106.995185] [<ffffffff812da590>] ? ddebug_proc_write+0xf0/0xf0
[ 106.995186] [<ffffffff810c7133>] ? copy_module_from_fd.isra.43+0x53/0x150
[ 106.995187] [<ffffffff810cb1a6>] SyS_finit_module+0xa6/0xd0
[ 106.995189] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b



 Comments   
Comment by Wang Shilong (Inactive) [ 01/Sep/14 ]

http://review.whamcloud.com/#/c/11718/

Comment by Peter Jones [ 01/Sep/14 ]

Amir

Could you please review this patch?

Thanks

Peter

Comment by Isaac Huang (Inactive) [ 03/Sep/14 ]

The bug seemed to be introduced by commit 92c51841c50cc4061c20b277d3f7c4468f2a80cc. While the proposed patch fixed the symptom, it left the underlying API inconsistency unfixed: the lnet internal initialization APIs are somewhat transaction-like in that if a init function fails it cleanups itself so the caller don't have to call the corresponding fini function, e.g. if lnet_prepare() fails there's no need to call lnet_unprepare().

But with 92c51841c50cc4061c20b277d3f7c4468f2a80cc, lnet_shutdown_lndnis() was removed from lnet_startup_lndnis(), so now the callers of lnet_startup_lndnis() will be responsible to clean up if lnet_startup_lndnis() has failed, which breaks the convention that init() functions clean up by themselves on failures. This kind of inconsistency will cause us troubles in the future. I'd suggest to:
1. Move lnet_shutdown_lndnis() back to lnet_startup_lndnis() so that lnet_startup_lndnis() will cleanup itself.
2. Move the code in lnet_startup_lndnis() that starts a single NI into a new function e.g. startup_a_single_ni().
3. Make lnet_dyn_add_ni() call startup_a_single_ni() instead of lnet_startup_lndnis().

Comment by Oleg Drokin [ 07/Oct/14 ]

Wang, do you have plans of addressing Isaac's points in your patch?

Comment by Wang Shilong (Inactive) [ 08/Oct/14 ]

Hi Oleg Drokin,

Yeah, i did address lsaac's comment!

Comment by Wang Shilong (Inactive) [ 09/Oct/14 ]

Hi Isaac Huang,

Thanks very much for your comments, could you take a look and give me some response
about last your comments:

"When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?"

My reply: lnet_shutdown_lndnis() could do that?

btw, i am not sure why i reply comments at redmine, it did not give email or something...

Comment by Isaac Huang (Inactive) [ 13/Oct/14 ]

lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all.

Comment by Isaac Huang (Inactive) [ 13/Oct/14 ]

I think the fix should be considered together with LU-5734 - the two are closely related.

Comment by Wang Shilong (Inactive) [ 20/Oct/14 ]

Hi lsaac Huang,

I am sorry for bothering you, could you please help review new version
and give me some feedbacks, thank you very much!

Best regards,
Wang Shilong

Comment by Peter Jones [ 30/Oct/14 ]

Landed for 2.7

Comment by Peter Jones [ 30/Oct/14 ]

Patch reverted. Amir could you please look into this issue? Thanks

Comment by Gerrit Updater [ 23/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/
Subject: LU-5568 lnet: fix kernel crash when network failed to start
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 66e9055b23433bd0aa8da5e49f3b665fb1b95532

Generated at Sat Feb 10 01:52:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.