[LU-12568] LNetError: 28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed Created: 19/Jul/19 Updated: 06/Feb/20 Resolved: 15/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
There is a use after free in the LNet response tracking code. If an MD is unlinked with a non-zero refcount the lnet_libhandle is invalidated so that future lookups of the MD are failed. /* must be called with lnet_res_lock held */
void
lnet_md_unlink(struct lnet_libmd *md)
{
if ((md->md_flags & LNET_MD_FLAG_ZOMBIE) == 0) {
/* first unlink attempt... */
struct lnet_me *me = md->md_me;
md->md_flags |= LNET_MD_FLAG_ZOMBIE;
/* Disassociate from ME (if any), and unlink it if it was created
* with LNET_UNLINK */
if (me != NULL) {
/* detach MD from portal */
lnet_ptl_detach_md(me, md);
if (me->me_unlink == LNET_UNLINK)
lnet_me_unlink(me);
}
/* ensure all future handle lookups fail */
lnet_res_lh_invalidate(&md->md_lh);
}
if (md->md_refcount != 0) {
CDEBUG(D_NET, "Queueing unlink of md %p\n", md);
return;
}
If a response tracker is attached to such an MD then it is possible for the lnet_finalize_expired_responses loop to free the rspt before it has been detached from the MD. static void
lnet_finalize_expired_responses(bool force)
{
<snip>
if (ktime_compare(ktime_get(), rspt->rspt_deadline) >= 0 ||
force) {
struct lnet_peer_ni *lpni;
lnet_nid_t nid;
md = lnet_handle2md(&rspt->rspt_mdh);
if (!md) {
LNetInvalidateMDHandle(&rspt->rspt_mdh);
lnet_res_unlock(i);
list_del_init(&rspt->rspt_on_list);
lnet_rspt_free(rspt, i);
continue;
}
When the final operation on the MD completes the MD is detached from the lnet_msg, the response tracker is detached from the MD, and the assert can be tripped: lnet_finalize()->lnet_msg_detach_md()->lnet_detach_rsp_tracker() void
lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt)
{
struct lnet_rsp_tracker *rspt;
/*
* msg has a refcount on the MD so the MD is not going away.
* The rspt queue for the cpt is protected by
* the lnet_net_lock(cpt). cpt is the cpt of the MD cookie.
*/
if (!md->md_rspt_ptr)
return;
rspt = md->md_rspt_ptr;
md->md_rspt_ptr = NULL;
/* debug code */
LASSERT(rspt->rspt_cpt == cpt);
|
| Comments |
| Comment by Chris Horn [ 19/Jul/19 ] |
|
These trace messages demonstrate the bug being hit: 00000400:00000200:10.0:1563533406.667520:000000:10121:0:(lib-move.c:4788:LNetPut()) LNetPut msg ffff881e8f3f9800 -> 12345-60@gni4 00000400:00000200:10.0:1563533406.667521:000000:10121:0:(lib-move.c:4817:LNetPut()) attach rsp tracker to msg ffff881e8f3f9800 md ffff881ce9276b28 00000400:00000200:10.0:1563533406.667523:000000:10121:0:(lib-move.c:4670:lnet_attach_rsp_tracker()) attach new rspt ffff881ff06b5c40 to md ffff881ce9276b28 cpt 0 00000400:00000200:10.0:1563533406.667623:000000:10121:0:(lib-move.c:1891:lnet_handle_send()) msg ffff881e8f3f9800 rspt_next_hop_nid = 10.12.0.4@o2ib40 00000400:00000200:0.0:1563533408.740457:000000:28061:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff881ce9276b28 00000400:00000200:4.0:1563533420.155422:000000:28096:0:(lib-move.c:2941:lnet_finalize_expired_responses()) Response timeout: md = ffff881d5caa6c38: nid = 10.12.0.4@o2ib40 <<<< lnet_finalize_expired_responses is executing after the MD was queued for unlink 00000400:00000100:14.0:1563533429.266380:000000:28086:0:(lib-move.c:908:lnet_post_send_locked()) Aborting msg ffff881e8f3f9800 for 12345-10.12.0.4@o2ib40: LNetM[DE]Unlink() already called on the MD/ME. 00000400:00000200:14.0:1563533429.299202:000000:28086:0:(lib-move.c:912:lnet_post_send_locked()) msg ffff881e8f3f9800 to 12345-10.12.0.4@o2ib40 canceled and will not be resent 00000400:00000200:14.0:1563533429.299204:000000:28086:0:(lib-msg.c:968:lnet_is_health_check()) Msg ffff881e8f3f9800 is in inconsistent state, don't perform health checking (-125, 0) 00000400:00000200:14.0:1563533429.299205:000000:28086:0:(lib-msg.c:973:lnet_is_health_check()) msg ffff881e8f3f9800 health check = 0, status = -125, hstatus = 0 00000400:00000200:14.0:1563533429.299209:000000:28086:0:(lib-msg.c:928:lnet_msg_detach_md()) Detach rsp tracker from msg ffff881e8f3f9800 md ffff881ce9276b28 00000400:00000200:14.0:1563533429.299210:000000:28086:0:(lib-move.c:2860:lnet_detach_rsp_tracker()) Detach rspt ffff881ff06b5c40 from md ffff881ce9276b28 cpt 0 00000400:00040000:14.0:1563533429.299212:000000:28086:0:(lib-move.c:2862:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: Note the response tracker was attached at 1563533406.667523. I'm using the default lnet_transaction_timeout of 10 seconds, so when we see the lnet_finalize_expired_responses loop running at time 1563533420.155422 we can be confident that we're hitting the code I outlined in the description of this ticket. |
| Comment by Gerrit Updater [ 20/Jul/19 ] |
|
Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35576 |
| Comment by Chris Horn [ 24/Jul/19 ] |
|
This issue was uncovered while testing the patch for |
| Comment by Gerrit Updater [ 15/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35576/ |
| Comment by Gerrit Updater [ 31/Oct/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36635 |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36635/ |