Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12568

LNetError: 28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed

Details

    • 3
    • 9223372036854775807

    Description

      There is a use after free in the LNet response tracking code.

      If an MD is unlinked with a non-zero refcount the lnet_libhandle is invalidated so that future lookups of the MD are failed.

      /* must be called with lnet_res_lock held */
      void
      lnet_md_unlink(struct lnet_libmd *md)
      {
              if ((md->md_flags & LNET_MD_FLAG_ZOMBIE) == 0) {
                      /* first unlink attempt... */
                      struct lnet_me *me = md->md_me;
      
                      md->md_flags |= LNET_MD_FLAG_ZOMBIE;
      
                      /* Disassociate from ME (if any), and unlink it if it was created
                       * with LNET_UNLINK */
                      if (me != NULL) {
                              /* detach MD from portal */
                              lnet_ptl_detach_md(me, md);
                              if (me->me_unlink == LNET_UNLINK)
                                      lnet_me_unlink(me);
                      }
      
                      /* ensure all future handle lookups fail */
                      lnet_res_lh_invalidate(&md->md_lh);
              }
      
              if (md->md_refcount != 0) {
                      CDEBUG(D_NET, "Queueing unlink of md %p\n", md);
                      return;
              }
      

      If a response tracker is attached to such an MD then it is possible for the lnet_finalize_expired_responses loop to free the rspt before it has been detached from the MD.

      static void
      lnet_finalize_expired_responses(bool force)
      {
      <snip>
                             if (ktime_compare(ktime_get(), rspt->rspt_deadline) >= 0 ||
                                  force) {
                                      struct lnet_peer_ni *lpni;
                                      lnet_nid_t nid;
      
                                      md = lnet_handle2md(&rspt->rspt_mdh);
                                      if (!md) {
                                              LNetInvalidateMDHandle(&rspt->rspt_mdh);
                                              lnet_res_unlock(i);
                                              list_del_init(&rspt->rspt_on_list);
                                              lnet_rspt_free(rspt, i);
                                              continue;
                                      }
      

      When the final operation on the MD completes the MD is detached from the lnet_msg, the response tracker is detached from the MD, and the assert can be tripped:

      lnet_finalize()->lnet_msg_detach_md()->lnet_detach_rsp_tracker()
      
      void
      lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt)
      {
              struct lnet_rsp_tracker *rspt;
      
              /*
               * msg has a refcount on the MD so the MD is not going away.
               * The rspt queue for the cpt is protected by
               * the lnet_net_lock(cpt). cpt is the cpt of the MD cookie.
               */
              if (!md->md_rspt_ptr)
                      return;
      
              rspt = md->md_rspt_ptr;
              md->md_rspt_ptr = NULL;
      
              /* debug code */
              LASSERT(rspt->rspt_cpt == cpt);
      

      Attachments

        Issue Links

          Activity

            [LU-12568] LNetError: 28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36635/
            Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 3df41bb8515d5012d7e2f19b2d7019e3e1b64a71

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36635/ Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 3df41bb8515d5012d7e2f19b2d7019e3e1b64a71

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36635
            Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 0c3f4854dea3e88bd216d04ca95d33cc7dd20d44

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36635 Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 0c3f4854dea3e88bd216d04ca95d33cc7dd20d44

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35576/
            Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4a4ac34de42c57cf6963c95aee8da634a767b38a

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35576/ Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4a4ac34de42c57cf6963c95aee8da634a767b38a
            hornc Chris Horn added a comment -

            This issue was uncovered while testing the patch for LU-12441, however we haven't been able to identify a manner in which that patch causes this issue.

            hornc Chris Horn added a comment - This issue was uncovered while testing the patch for LU-12441 , however we haven't been able to identify a manner in which that patch causes this issue.

            Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35576
            Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2b069b22e1e22d5460ce266af09df9238dd4f031

            gerrit Gerrit Updater added a comment - Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35576 Subject: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2b069b22e1e22d5460ce266af09df9238dd4f031
            hornc Chris Horn added a comment - - edited

            These trace messages demonstrate the bug being hit:

            00000400:00000200:10.0:1563533406.667520:000000:10121:0:(lib-move.c:4788:LNetPut()) LNetPut msg ffff881e8f3f9800 -> 12345-60@gni4
            00000400:00000200:10.0:1563533406.667521:000000:10121:0:(lib-move.c:4817:LNetPut()) attach rsp tracker to msg ffff881e8f3f9800 md ffff881ce9276b28
            00000400:00000200:10.0:1563533406.667523:000000:10121:0:(lib-move.c:4670:lnet_attach_rsp_tracker()) attach new rspt ffff881ff06b5c40 to md ffff881ce9276b28 cpt 0
            00000400:00000200:10.0:1563533406.667623:000000:10121:0:(lib-move.c:1891:lnet_handle_send()) msg ffff881e8f3f9800 rspt_next_hop_nid = 10.12.0.4@o2ib40
            00000400:00000200:0.0:1563533408.740457:000000:28061:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff881ce9276b28
            
            00000400:00000200:4.0:1563533420.155422:000000:28096:0:(lib-move.c:2941:lnet_finalize_expired_responses()) Response timeout: md = ffff881d5caa6c38: nid = 10.12.0.4@o2ib40 <<<< lnet_finalize_expired_responses is executing after the MD was queued for unlink
            
            00000400:00000100:14.0:1563533429.266380:000000:28086:0:(lib-move.c:908:lnet_post_send_locked()) Aborting msg ffff881e8f3f9800 for 12345-10.12.0.4@o2ib40: LNetM[DE]Unlink() already called on the MD/ME.
            00000400:00000200:14.0:1563533429.299202:000000:28086:0:(lib-move.c:912:lnet_post_send_locked()) msg ffff881e8f3f9800 to 12345-10.12.0.4@o2ib40 canceled and will not be resent
            00000400:00000200:14.0:1563533429.299204:000000:28086:0:(lib-msg.c:968:lnet_is_health_check()) Msg ffff881e8f3f9800 is in inconsistent state, don't perform health checking (-125, 0)
            00000400:00000200:14.0:1563533429.299205:000000:28086:0:(lib-msg.c:973:lnet_is_health_check()) msg ffff881e8f3f9800 health check = 0, status = -125, hstatus = 0
            00000400:00000200:14.0:1563533429.299209:000000:28086:0:(lib-msg.c:928:lnet_msg_detach_md()) Detach rsp tracker from msg ffff881e8f3f9800 md ffff881ce9276b28
            00000400:00000200:14.0:1563533429.299210:000000:28086:0:(lib-move.c:2860:lnet_detach_rsp_tracker()) Detach rspt ffff881ff06b5c40 from md ffff881ce9276b28 cpt 0
            00000400:00040000:14.0:1563533429.299212:000000:28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed:
            

            Note the response tracker was attached at 1563533406.667523. I'm using the default lnet_transaction_timeout of 10 seconds, so when we see the lnet_finalize_expired_responses loop running at time 1563533420.155422 we can be confident that we're hitting the code I outlined in the description of this ticket.

            hornc Chris Horn added a comment - - edited These trace messages demonstrate the bug being hit: 00000400:00000200:10.0:1563533406.667520:000000:10121:0:(lib-move.c:4788:LNetPut()) LNetPut msg ffff881e8f3f9800 -> 12345-60@gni4 00000400:00000200:10.0:1563533406.667521:000000:10121:0:(lib-move.c:4817:LNetPut()) attach rsp tracker to msg ffff881e8f3f9800 md ffff881ce9276b28 00000400:00000200:10.0:1563533406.667523:000000:10121:0:(lib-move.c:4670:lnet_attach_rsp_tracker()) attach new rspt ffff881ff06b5c40 to md ffff881ce9276b28 cpt 0 00000400:00000200:10.0:1563533406.667623:000000:10121:0:(lib-move.c:1891:lnet_handle_send()) msg ffff881e8f3f9800 rspt_next_hop_nid = 10.12.0.4@o2ib40 00000400:00000200:0.0:1563533408.740457:000000:28061:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff881ce9276b28 00000400:00000200:4.0:1563533420.155422:000000:28096:0:(lib-move.c:2941:lnet_finalize_expired_responses()) Response timeout: md = ffff881d5caa6c38: nid = 10.12.0.4@o2ib40 <<<< lnet_finalize_expired_responses is executing after the MD was queued for unlink 00000400:00000100:14.0:1563533429.266380:000000:28086:0:(lib-move.c:908:lnet_post_send_locked()) Aborting msg ffff881e8f3f9800 for 12345-10.12.0.4@o2ib40: LNetM[DE]Unlink() already called on the MD/ME. 00000400:00000200:14.0:1563533429.299202:000000:28086:0:(lib-move.c:912:lnet_post_send_locked()) msg ffff881e8f3f9800 to 12345-10.12.0.4@o2ib40 canceled and will not be resent 00000400:00000200:14.0:1563533429.299204:000000:28086:0:(lib-msg.c:968:lnet_is_health_check()) Msg ffff881e8f3f9800 is in inconsistent state, don't perform health checking (-125, 0) 00000400:00000200:14.0:1563533429.299205:000000:28086:0:(lib-msg.c:973:lnet_is_health_check()) msg ffff881e8f3f9800 health check = 0, status = -125, hstatus = 0 00000400:00000200:14.0:1563533429.299209:000000:28086:0:(lib-msg.c:928:lnet_msg_detach_md()) Detach rsp tracker from msg ffff881e8f3f9800 md ffff881ce9276b28 00000400:00000200:14.0:1563533429.299210:000000:28086:0:(lib-move.c:2860:lnet_detach_rsp_tracker()) Detach rspt ffff881ff06b5c40 from md ffff881ce9276b28 cpt 0 00000400:00040000:14.0:1563533429.299212:000000:28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: Note the response tracker was attached at 1563533406.667523. I'm using the default lnet_transaction_timeout of 10 seconds, so when we see the lnet_finalize_expired_responses loop running at time 1563533420.155422 we can be confident that we're hitting the code I outlined in the description of this ticket.

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: