[LU-7362] During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash Created: 30/Oct/15  Updated: 13/Oct/16  Resolved: 18/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: James A Simmons Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Cray Routers running latest Lustre pre-2.8


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing the latest Lustre pre-2.8 release on one of our Cray systems DVS was enabled by mistake on a router which LNet then crashed with the following backtrace:

2015-10-27T11:39:24.142188-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker
2015-10-27T11:39:24.142197-04:00 c0-1c1s1n1 Call Trace:
2015-10-27T11:39:24.142213-04:00 c0-1c1s1n1 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-10-27T11:39:24.142225-04:00 c0-1c1s1n1 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-10-27T11:39:24.142240-04:00 c0-1c1s1n1 [<ffffffffa025b897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2015-10-27T11:39:24.142255-04:00 c0-1c1s1n1 [<ffffffffa025bde7>] lbug_with_loc+0x47/0xc0 [libcfs]
2015-10-27T11:39:24.181560-04:00 c0-1c1s1n1 [<ffffffffa02f29b6>] lnet_router_checker+0x566/0x5a0 [lnet]
2015-10-27T11:39:24.181581-04:00 c0-1c1s1n1 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-10-27T11:39:24.181609-04:00 c0-1c1s1n1 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
2015-10-27T11:39:24.181616-04:00 c0-1c1s1n1 Kernel panic - not syncing: LBUG
2015-10-27T11:39:24.181627-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
2015-10-27T11:39:24.211395-04:00 c0-1c1s1n1 Call Trace:
2015-10-27T11:39:24.211415-04:00 c0-1c1s1n1 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
2015-10-27T11:39:24.211422-04:00 c0-1c1s1n1 [<ffffffff81004eb9>] dump_trace+0x89/0x430
2015-10-27T11:39:24.211476-04:00 c0-1c1s1n1 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
2015-10-27T11:39:24.211488-04:00 c0-1c1s1n1 [<ffffffff810060f5>] show_trace+0x15/0x20
2015-10-27T11:39:24.211515-04:00 c0-1c1s1n1 [<ffffffff8148b31c>] dump_stack+0x79/0x84
2015-10-27T11:39:24.211531-04:00 c0-1c1s1n1 [<ffffffff8148b3bb>] panic+0x94/0x1da
2015-10-27T11:39:24.211560-04:00 c0-1c1s1n1 [<ffffffffa025be4b>] lbug_with_loc+0xab/0xc0 [libcfs]
2015-10-27T11:39:24.211579-04:00 c0-1c1s1n1 [<ffffffffa02f29b6>] lnet_router_checker+0x566/0x5a0 [lnet]
2015-10-27T11:39:24.211586-04:00 c0-1c1s1n1 [<ffffffff81067ace>] kthread+0x9e/0xb0
2015-10-27T11:39:24.241857-04:00 c0-1c1s1n1 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

While DVS is a external utility on top of LNet it shouldn't be able to crash a LNet router.



 Comments   
Comment by Doug Oucharek (Inactive) [ 30/Oct/15 ]

James: This appears to be from an LASSERT. Can you tell us which of the two possible asserts in the router checker routine this is coming from:

LASSERT (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING);
or
LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_STOPPING);

It seems likely to be the first case (check for RUNNING). Initial thought: we are starting up and launch the router checker thread. Before it can run, DVS stops LNet (for some reason). That changes state to STOPPING before the router checker thread starts and does that first assert.

Personally, I hate asserts for checks like this. I'd like to address this ticket by removing that first assert and let the router checker loop terminate immediately because the state has changed (i.e. neither one of these asserts is really protecting us from anything and are not valid reasons for crashing a system).

As DVS is a Cray tool, could someone at Cray comment on whether DVS would be stopping LNet immediately upon startup like this?

Comment by Matt Ezell [ 30/Oct/15 ]
crash> sym the_lnet
ffffffffa030cb60 (B) the_lnet [lnet]
crash> lnet_t ffffffffa030cb60 | grep ln_rc_state
  ln_rc_state = 2,
lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_SHUTDOWN		0	/* not started */
lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_RUNNING		1	/* started up OK */
lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_STOPPING		2	/* telling thread to stop */
2015-10-27T11:39:22.941791-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni110 [16/8192/0/0]
2015-10-27T11:39:22.941799-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni111 [16/8192/0/0]
2015-10-27T11:39:22.941805-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni112 [16/8192/0/0]
2015-10-27T11:39:22.941812-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib106 [63/2560/0/180]
2015-10-27T11:39:22.941818-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib225 [63/2560/0/180]
2015-10-27T11:39:24.142128-04:00 c0-1c1s1n1 DVS: dvs_lnet_init: No network ID found on configured lnd (gni100)
2015-10-27T11:39:24.142174-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) ASSERTION( the_lnet.ln_rc_state == 1 ) failed: 
2015-10-27T11:39:24.142181-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) LBUG
Comment by Chris Horn [ 30/Oct/15 ]

DVS will call LNetNIFini() (stopping LNet) when it can't find the "network ID...on configured lnd"

Comment by Doug Oucharek (Inactive) [ 30/Oct/15 ]

Thanks Matt. That seems to confirm my suspicion. I'm assuming that dvs_lnet_init() stops LNet because it cannot find the network ID it is expecting.

So, I'm proposing to remove the offending assert. Rather than protecting us from anything (which it does not), it has become a problem.

Comment by Gerrit Updater [ 30/Oct/15 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17003
Subject: LU-7362 lnet: Remove LASSERTS from router checker
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6d6f7ab62877c5abc445734f7eaef4a81863796

Comment by Gerrit Updater [ 13/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17003/
Subject: LU-7362 lnet: Remove LASSERTS from router checker
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df6cf859bbb29392064e6ddb701f3357e01b3a13

Comment by Joseph Gmitter (Inactive) [ 18/Nov/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:08:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.