[LU-7362] During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash Created: 30/Oct/15 Updated: 13/Oct/16 Resolved: 18/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James A Simmons | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Cray Routers running latest Lustre pre-2.8 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
While testing the latest Lustre pre-2.8 release on one of our Cray systems DVS was enabled by mistake on a router which LNet then crashed with the following backtrace: 2015-10-27T11:39:24.142188-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker While DVS is a external utility on top of LNet it shouldn't be able to crash a LNet router. |
| Comments |
| Comment by Doug Oucharek (Inactive) [ 30/Oct/15 ] |
|
James: This appears to be from an LASSERT. Can you tell us which of the two possible asserts in the router checker routine this is coming from: LASSERT (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING); It seems likely to be the first case (check for RUNNING). Initial thought: we are starting up and launch the router checker thread. Before it can run, DVS stops LNet (for some reason). That changes state to STOPPING before the router checker thread starts and does that first assert. Personally, I hate asserts for checks like this. I'd like to address this ticket by removing that first assert and let the router checker loop terminate immediately because the state has changed (i.e. neither one of these asserts is really protecting us from anything and are not valid reasons for crashing a system). As DVS is a Cray tool, could someone at Cray comment on whether DVS would be stopping LNet immediately upon startup like this? |
| Comment by Matt Ezell [ 30/Oct/15 ] |
crash> sym the_lnet ffffffffa030cb60 (B) the_lnet [lnet] crash> lnet_t ffffffffa030cb60 | grep ln_rc_state ln_rc_state = 2, lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_SHUTDOWN 0 /* not started */ lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_RUNNING 1 /* started up OK */ lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_STOPPING 2 /* telling thread to stop */ 2015-10-27T11:39:22.941791-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni110 [16/8192/0/0] 2015-10-27T11:39:22.941799-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni111 [16/8192/0/0] 2015-10-27T11:39:22.941805-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni112 [16/8192/0/0] 2015-10-27T11:39:22.941812-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib106 [63/2560/0/180] 2015-10-27T11:39:22.941818-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib225 [63/2560/0/180] 2015-10-27T11:39:24.142128-04:00 c0-1c1s1n1 DVS: dvs_lnet_init: No network ID found on configured lnd (gni100) 2015-10-27T11:39:24.142174-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) ASSERTION( the_lnet.ln_rc_state == 1 ) failed: 2015-10-27T11:39:24.142181-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) LBUG |
| Comment by Chris Horn [ 30/Oct/15 ] |
|
DVS will call LNetNIFini() (stopping LNet) when it can't find the "network ID...on configured lnd" |
| Comment by Doug Oucharek (Inactive) [ 30/Oct/15 ] |
|
Thanks Matt. That seems to confirm my suspicion. I'm assuming that dvs_lnet_init() stops LNet because it cannot find the network ID it is expecting. So, I'm proposing to remove the offending assert. Rather than protecting us from anything (which it does not), it has become a problem. |
| Comment by Gerrit Updater [ 30/Oct/15 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17003 |
| Comment by Gerrit Updater [ 13/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17003/ |
| Comment by Joseph Gmitter (Inactive) [ 18/Nov/15 ] |
|
Landed for 2.8 |