Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7362

During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • Cray Routers running latest Lustre pre-2.8
    • 3
    • 9223372036854775807

    Description

      While testing the latest Lustre pre-2.8 release on one of our Cray systems DVS was enabled by mistake on a router which LNet then crashed with the following backtrace:

      2015-10-27T11:39:24.142188-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker
      2015-10-27T11:39:24.142197-04:00 c0-1c1s1n1 Call Trace:
      2015-10-27T11:39:24.142213-04:00 c0-1c1s1n1 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-10-27T11:39:24.142225-04:00 c0-1c1s1n1 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-10-27T11:39:24.142240-04:00 c0-1c1s1n1 [<ffffffffa025b897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2015-10-27T11:39:24.142255-04:00 c0-1c1s1n1 [<ffffffffa025bde7>] lbug_with_loc+0x47/0xc0 [libcfs]
      2015-10-27T11:39:24.181560-04:00 c0-1c1s1n1 [<ffffffffa02f29b6>] lnet_router_checker+0x566/0x5a0 [lnet]
      2015-10-27T11:39:24.181581-04:00 c0-1c1s1n1 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-10-27T11:39:24.181609-04:00 c0-1c1s1n1 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10
      2015-10-27T11:39:24.181616-04:00 c0-1c1s1n1 Kernel panic - not syncing: LBUG
      2015-10-27T11:39:24.181627-04:00 c0-1c1s1n1 Pid: 16118, comm: router_checker Tainted: P 3.0.101-0.46.1_1.0502.8871-cray_gem_s #1
      2015-10-27T11:39:24.211395-04:00 c0-1c1s1n1 Call Trace:
      2015-10-27T11:39:24.211415-04:00 c0-1c1s1n1 [<ffffffff81006651>] try_stack_unwind+0x161/0x1a0
      2015-10-27T11:39:24.211422-04:00 c0-1c1s1n1 [<ffffffff81004eb9>] dump_trace+0x89/0x430
      2015-10-27T11:39:24.211476-04:00 c0-1c1s1n1 [<ffffffff810060bc>] show_trace_log_lvl+0x5c/0x80
      2015-10-27T11:39:24.211488-04:00 c0-1c1s1n1 [<ffffffff810060f5>] show_trace+0x15/0x20
      2015-10-27T11:39:24.211515-04:00 c0-1c1s1n1 [<ffffffff8148b31c>] dump_stack+0x79/0x84
      2015-10-27T11:39:24.211531-04:00 c0-1c1s1n1 [<ffffffff8148b3bb>] panic+0x94/0x1da
      2015-10-27T11:39:24.211560-04:00 c0-1c1s1n1 [<ffffffffa025be4b>] lbug_with_loc+0xab/0xc0 [libcfs]
      2015-10-27T11:39:24.211579-04:00 c0-1c1s1n1 [<ffffffffa02f29b6>] lnet_router_checker+0x566/0x5a0 [lnet]
      2015-10-27T11:39:24.211586-04:00 c0-1c1s1n1 [<ffffffff81067ace>] kthread+0x9e/0xb0
      2015-10-27T11:39:24.241857-04:00 c0-1c1s1n1 [<ffffffff81490074>] kernel_thread_helper+0x4/0x10

      While DVS is a external utility on top of LNet it shouldn't be able to crash a LNet router.

      Attachments

        Activity

          [LU-7362] During our larger scale testing DVS was accidentally started on a router which could LNet to kernel crash

          Landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17003/
          Subject: LU-7362 lnet: Remove LASSERTS from router checker
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: df6cf859bbb29392064e6ddb701f3357e01b3a13

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17003/ Subject: LU-7362 lnet: Remove LASSERTS from router checker Project: fs/lustre-release Branch: master Current Patch Set: Commit: df6cf859bbb29392064e6ddb701f3357e01b3a13

          Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17003
          Subject: LU-7362 lnet: Remove LASSERTS from router checker
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: e6d6f7ab62877c5abc445734f7eaef4a81863796

          gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17003 Subject: LU-7362 lnet: Remove LASSERTS from router checker Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e6d6f7ab62877c5abc445734f7eaef4a81863796

          Thanks Matt. That seems to confirm my suspicion. I'm assuming that dvs_lnet_init() stops LNet because it cannot find the network ID it is expecting.

          So, I'm proposing to remove the offending assert. Rather than protecting us from anything (which it does not), it has become a problem.

          doug Doug Oucharek (Inactive) added a comment - Thanks Matt. That seems to confirm my suspicion. I'm assuming that dvs_lnet_init() stops LNet because it cannot find the network ID it is expecting. So, I'm proposing to remove the offending assert. Rather than protecting us from anything (which it does not), it has become a problem.
          hornc Chris Horn added a comment -

          DVS will call LNetNIFini() (stopping LNet) when it can't find the "network ID...on configured lnd"

          hornc Chris Horn added a comment - DVS will call LNetNIFini() (stopping LNet) when it can't find the "network ID...on configured lnd"
          ezell Matt Ezell added a comment -
          crash> sym the_lnet
          ffffffffa030cb60 (B) the_lnet [lnet]
          crash> lnet_t ffffffffa030cb60 | grep ln_rc_state
            ln_rc_state = 2,
          
          lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_SHUTDOWN		0	/* not started */
          lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_RUNNING		1	/* started up OK */
          lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_STOPPING		2	/* telling thread to stop */
          
          2015-10-27T11:39:22.941791-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni110 [16/8192/0/0]
          2015-10-27T11:39:22.941799-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni111 [16/8192/0/0]
          2015-10-27T11:39:22.941805-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni112 [16/8192/0/0]
          2015-10-27T11:39:22.941812-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib106 [63/2560/0/180]
          2015-10-27T11:39:22.941818-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib225 [63/2560/0/180]
          2015-10-27T11:39:24.142128-04:00 c0-1c1s1n1 DVS: dvs_lnet_init: No network ID found on configured lnd (gni100)
          2015-10-27T11:39:24.142174-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) ASSERTION( the_lnet.ln_rc_state == 1 ) failed: 
          2015-10-27T11:39:24.142181-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) LBUG
          
          ezell Matt Ezell added a comment - crash> sym the_lnet ffffffffa030cb60 (B) the_lnet [lnet] crash> lnet_t ffffffffa030cb60 | grep ln_rc_state ln_rc_state = 2, lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_SHUTDOWN 0 /* not started */ lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_RUNNING 1 /* started up OK */ lnet/include/lnet/lib-types.h:#define LNET_RC_STATE_STOPPING 2 /* telling thread to stop */ 2015-10-27T11:39:22.941791-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni110 [16/8192/0/0] 2015-10-27T11:39:22.941799-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni111 [16/8192/0/0] 2015-10-27T11:39:22.941805-04:00 c0-1c1s1n1 LNet: Added LNI 701@gni112 [16/8192/0/0] 2015-10-27T11:39:22.941812-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib106 [63/2560/0/180] 2015-10-27T11:39:22.941818-04:00 c0-1c1s1n1 LNet: Added LNI 10.36.230.100@o2ib225 [63/2560/0/180] 2015-10-27T11:39:24.142128-04:00 c0-1c1s1n1 DVS: dvs_lnet_init: No network ID found on configured lnd (gni100) 2015-10-27T11:39:24.142174-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) ASSERTION( the_lnet.ln_rc_state == 1 ) failed: 2015-10-27T11:39:24.142181-04:00 c0-1c1s1n1 LNetError: 16118:0:(router.c:1241:lnet_router_checker()) LBUG

          James: This appears to be from an LASSERT. Can you tell us which of the two possible asserts in the router checker routine this is coming from:

          LASSERT (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING);
          or
          LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_STOPPING);

          It seems likely to be the first case (check for RUNNING). Initial thought: we are starting up and launch the router checker thread. Before it can run, DVS stops LNet (for some reason). That changes state to STOPPING before the router checker thread starts and does that first assert.

          Personally, I hate asserts for checks like this. I'd like to address this ticket by removing that first assert and let the router checker loop terminate immediately because the state has changed (i.e. neither one of these asserts is really protecting us from anything and are not valid reasons for crashing a system).

          As DVS is a Cray tool, could someone at Cray comment on whether DVS would be stopping LNet immediately upon startup like this?

          doug Doug Oucharek (Inactive) added a comment - James: This appears to be from an LASSERT. Can you tell us which of the two possible asserts in the router checker routine this is coming from: LASSERT (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING); or LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_STOPPING); It seems likely to be the first case (check for RUNNING). Initial thought: we are starting up and launch the router checker thread. Before it can run, DVS stops LNet (for some reason). That changes state to STOPPING before the router checker thread starts and does that first assert. Personally, I hate asserts for checks like this. I'd like to address this ticket by removing that first assert and let the router checker loop terminate immediately because the state has changed (i.e. neither one of these asserts is really protecting us from anything and are not valid reasons for crashing a system). As DVS is a Cray tool, could someone at Cray comment on whether DVS would be stopping LNet immediately upon startup like this?

          People

            doug Doug Oucharek (Inactive)
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: