[LU-7324] Race condition on deleting lnet_msg Created: 21/Oct/15 Updated: 27/Jan/17 Resolved: 25/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sebastien Piechurski | Assignee: | Henri Doreau (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b, patch | ||
| Environment: |
bull lustre 2.5.3.90 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We hit several times on different OSS what looks like a race condition on freeing an lnet_msg. LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed: with this stack: crash> bt The crash seems to occur here: if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 && if ((rc & LNET_MATCHMD_FINISH) != 0) Can you please help with analyzing what can cause the race ? |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 21/Oct/15 ] |
|
Hi Amir, |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/15 ] |
|
Ok, here are more details after on-site analysis of some of the multiple crash-dumps for what looks to be the same issue. Some HDF5/MPIIO user programs have been identified to quite easily reproduce the issue, but always at large scale on 2048/4096 cores. Attempts to reduce have been unsuccessful until now. All these crashes have been caused by a use after free of a lnet_msg, and they are almost always BUG()s (of same signature described in The code always involved in these crashes is in lnet_ptl_match_md()/lnet_ptl_match_delay() (inlined)/lnet_ni_eager_recv() routines area. Interesting is the frequent LBUG that seem to occur, in an ll_ost_create thread, almost any times before the BUG() and the concerned thread's stack contains also a reference to the same memory location than the lnet_msg... More interesting is that in order to better track the highly suspected "use after free" of lnet_msg, we have enabled D_MALLOC+D_NET debug traces and have tried to reproduce. 0>LNetError: 32130:0:(lib-move.c:706:lnet_ni_eager_recv()) ASSERTION( msg->msg_receiving ) failed: <0>LNetError: 32130:0:(lib-move.c:706:lnet_ni_eager_recv()) LBUG <4>Pid: 32130, comm: kiblnd_sd_00_01 <4> <4>Call Trace: <4> [<ffffffffa0521895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0521e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa062f7c0>] lnet_ni_eager_recv+0x1d0/0x210 [lnet] <4> [<ffffffffa06356b1>] lnet_parse+0x1161/0x1870 [lnet] <4> [<ffffffff8129b8f4>] ? snprintf+0x34/0x40 <4> [<ffffffffa0a1c9db>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd] <4> [<ffffffffa0a1d6c3>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd] <4> [<ffffffffa0a1d872>] kiblnd_complete+0x62/0xe0 [ko2iblnd] <4> [<ffffffffa0a1dc2a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd] <4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0a1d8f0>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd] <4> [<ffffffff810a101e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 this is again caused by a lnet_msg (0xffff880349d12c00) that is found to be corrupted (mainly zero'ed), but this time, the debug log extracted from the crash-dump can confirm that the message has been freed and even reallocated several times by the same ll_ost_create/26433 thread (and even later by others and as a "ctx->lc_value" in keys_init()/keys_fini() routines, due to same "size-512" anonymous kmem_cache usage, and which seems to be the same context than the frequent companion LBUG!!!) during the time window where the kiblnd_sd_00_01/32130 thread was still referencing an using it, as we can see in following debug trace extract that has been egrep'ed to only show 26433 and 32130 threads concurrent activity around the LBUG : 00000400:00000010:0.0:1446151976.446470:0:32130:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880349d12c00 (tot 480051974). 00000400:00000010:16.0:1446151976.446472:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320ed55c0 (tot 480052358). 00000020:00000010:16.0:1446151976.446481:0:26433:0:(lu_object.c:1745:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88034cca8c00. 02000000:00000010:16.0:1446151976.446483:0:26433:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880320ed5cc0. 00000020:00000010:16.0:1446151976.446486:0:26433:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880320ed5e40. 00000400:00000200:0.0:1446151976.446488:0:32130:0:(lib-ptl.c:546:lnet_ptl_match_md()) Request from 12345-AA.CCC.VV.UUU@o2ib2 of length 224 into portal 7 MB=0x56138e94337c0 00000020:00000010:16.0:1446151976.446488:0:26433:0:(tgt_main.c:131:tgt_ses_key_init()) kmalloced 'value': 56 at ffff88030bfa7440. 02000000:00000010:16.0:1446151976.446493:0:26433:0:(sec_null.c:331:null_alloc_rs()) kmalloced 'rs': 632 at ffff8802c1708c00. 00010000:00000200:16.0:1446151976.446496:0:26433:0:(ldlm_lib.c:2438:target_send_reply_msg()) @@@ sending reply req@ffff88032934a050 x1514271791306720/t0(0) o13->9a0f23d4-d572-d43b-be8a-a3e62039fe4a@AA.BBB.XX.Y@o2ib3:0/0 lens 224/368 e 0 to 0 dl 1446151993 ref 1 fl Interpret:/0/0 rc 0/0 00000400:00000010:16.0:1446151976.446501:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff88032aa23c40 (tot 480053254). 00000400:00000010:16.0:1446151976.446538:0:26433:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880347bf3000 (tot 480054438). 00000400:00000200:16.0:1446151976.446547:0:26433:0:(lib-move.c:2189:LNetPut()) LNetPut -> 12345-AA.BBB.XX.Y@o2ib3 02000000:00000010:16.0:1446151976.446549:0:26433:0:(capa.c:416:lu_capainfo_key_fini()) kfreed 'info': 128 at ffff880320ed5cc0. 00000020:00000010:16.0:1446151976.446551:0:26433:0:(lu_ucred.c:53:lu_ucred_key_fini()) kfreed 'info': 72 at ffff880320ed5e40. 00000020:00000010:16.0:1446151976.446552:0:26433:0:(tgt_main.c:131:tgt_ses_key_fini()) kfreed 'info': 56 at ffff88030bfa7440. 00000020:00000010:16.0:1446151976.446554:0:26433:0:(lu_object.c:1675:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff88034cca8c00. 00000400:00000010:16.0:1446151976.446558:0:26433:0:(lib-lnet.h:500:lnet_me_alloc()) alloc '(me)': 96 at ffff880320ed5e40 (tot 480055654). 00000400:00000010:16.0:1446151976.446569:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320ed5cc0 (tot 480056806). 00000400:00000200:16.0:1446151976.446590:0:26433:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 7 from 12345-AA.CCC.VV.UUU@o2ib2 of length 224/224 into md 0x756560e1 [1] + 0 00000400:00000010:16.0:1446151976.446592:0:26433:0:(lib-lnet.h:508:lnet_me_free()) kfreed 'me': 96 at ffff880320ed5e40 (tot 480058214). 00000400:00000200:16.0:1446151976.446594:0:26433:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ffff880320ed5cc0 00000400:00000200:16.0:1446151976.446596:0:26433:0:(lib-ptl.c:694:lnet_ptl_attach_md()) Resuming delayed PUT from 12345-AA.CCC.VV.UUU@o2ib2 portal 7 match 1514271943112640 offset 0 length 224. 00000400:00000200:16.0:1446151976.446599:0:26433:0:(lib-move.c:2092:lnet_recv_delayed_msg_list()) Resuming delayed PUT from 12345-AA.CCC.VV.UUU@o2ib2 portal 7 match 1514271943112640 offset 0 length 224. 00000400:00000200:16.0:1446151976.446609:0:26433:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff880320ed5cc0 00000400:00000010:16.0:1446151976.446610:0:26433:0:(lib-lnet.h:491:lnet_md_free()) kfreed 'md': 128 at ffff880320ed5cc0 (tot 480058054). 00000400:00000010:16.0:1446151976.446612:0:26433:0:(lib-lnet.h:528:lnet_msg_free()) kfreed 'msg': 352 at ffff880349d12c00 (tot 480057702). 00000800:00000200:16.0:1446151976.446614:0:26433:0:(o2iblnd_cb.c:185:kiblnd_post_rx()) conn[ffff88043cb91000] (23)++ 00000800:00000200:16.0:1446151976.446615:0:26433:0:(o2iblnd_cb.c:972:kiblnd_check_sends()) conn[ffff88043cb91000] (24)++ 00000800:00000200:16.0:1446151976.446617:0:26433:0:(o2iblnd_cb.c:999:kiblnd_check_sends()) conn[ffff88043cb91000] (25)-- 00000800:00000200:16.0:1446151976.446618:0:26433:0:(o2iblnd_cb.c:214:kiblnd_post_rx()) conn[ffff88043cb91000] (24)-- 00000020:00000010:16.0:1446151976.446621:0:26433:0:(lu_object.c:1745:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880349d12c00. 02000000:00000010:16.0:1446151976.446623:0:26433:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880320ed5cc0. 00000020:00000010:16.0:1446151976.446624:0:26433:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880320ed5e40. 00000020:00000010:16.0:1446151976.446625:0:26433:0:(tgt_main.c:131:tgt_ses_key_init()) kmalloced 'value': 56 at ffff88030bfa7440. 02000000:00000010:16.0:1446151976.446629:0:26433:0:(sec_null.c:331:null_alloc_rs()) kmalloced 'rs': 632 at ffff8802c1708800. 00010000:00000200:16.0:1446151976.446632:0:26433:0:(ldlm_lib.c:2438:target_send_reply_msg()) @@@ sending reply req@ffff880320839050 x1514271650790784/t0(0) o13->4b00db46-7119-e7ab-b4f7-f1ffe607543e@AA.CCC.WW.ZZ@o2ib2:0/0 lens 224/368 e 0 to 0 dl 1446151993 ref 1 fl Interpret:/0/0 rc 0/0 00000400:00000010:16.0:1446151976.446637:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff88032aa23f40 (tot 480058022). 00000400:00000010:16.0:1446151976.446646:0:26433:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff88034cca8c00 (tot 480058758). 00000400:00000200:16.0:1446151976.446665:0:26433:0:(lib-move.c:2189:LNetPut()) LNetPut -> 12345-AA.CCC.WW.ZZ@o2ib2 02000000:00000010:16.0:1446151976.446668:0:26433:0:(capa.c:416:lu_capainfo_key_fini()) kfreed 'info': 128 at ffff880320ed5cc0. 00000020:00000010:16.0:1446151976.446669:0:26433:0:(lu_ucred.c:53:lu_ucred_key_fini()) kfreed 'info': 72 at ffff880320ed5e40. 00000020:00000010:16.0:1446151976.446671:0:26433:0:(tgt_main.c:131:tgt_ses_key_fini()) kfreed 'info': 56 at ffff88030bfa7440. 00000020:00000010:16.0:1446151976.446672:0:26433:0:(lu_object.c:1675:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880349d12c00. 00000400:00000200:0.0:1446151976.446681:0:32130:0:(lib-ptl.c:603:lnet_ptl_match_md()) Delaying PUT from 12345-AA.CCC.VV.UUU@o2ib2 ptl 7 MB 0x56138e94337c0 off 0 len 224 00000400:00040000:0.0:1446151976.446703:0:32130:0:(lib-move.c:706:lnet_ni_eager_recv()) ASSERTION( msg->msg_receiving ) failed: The fact that the kiblnd_sd_00_01/32130 thread appears not been running/scheduled during some time is a bit curious but could be only a pure kernel/scheduling consequence (or intermediate block on either ?), but anyway, this clearly indicate that there is a window where a thread can manipulate a lnet_msg that could be freed in the meantime.
lnet_parse()
lnet_parse_put()
again: lnet_ptl_match_md()
lnet_ptl_match_delay()
where :
cfs_list_add_tail(&msg->msg_list,
&ptl->ptl_msg_stealing);
or
cfs_list_del_init(&msg->msg_list);
or
msg->msg_rx_delayed = 1;
list_add_tail(&msg->msg_list,
&ptl->ptl_msg_delayed);
all of this in a for(;;) loop
with lnet_[ptl,res]_[lock,unlock]()
around each iteration
<<<<<< some delay has occured >>>>>>>
lnet_recv_put(ni, msg)
or
lnet_ni_eager_recv(ni, msg)
goto again
When during the delay where the kiblnd_sd thread has been disabled from running or being scheduled, the ll_ost_create/26433 thread may have executed the following code : LNetMDAttach()
lnet_ptl_attach_md()
lnet_ptl_lock()
again:
where msgs are respectively moved from
ptl_msg_stealing then ptl_msg_delayed lists
to a private/"matches" list
goto again
lnet_ptl_unlock()
lnet_recv_delayed_msg_list(&matches)
lnet_recv_put(msg)
lnet_ni_recv()
kiblnd_recv()
lnet_finalize()
lnet_complete_msg_locked()
lnet_msg_free(msg)
then msg memory
can be re-alloc!!
So seems to me that to be safe, at least the lnet_ptl_[un]lock() protection must surround the whole lnet_msg references in lnet_parse_put(). |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/15 ] |
|
I am raising the priority to Critical, since due to this problem site suffers one or two OSSs crashes per day now. |
| Comment by Gerrit Updater [ 06/Nov/15 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17065 |
| Comment by Liang Zhen (Inactive) [ 06/Nov/15 ] |
|
Hi Bruno, thanks for digging into code, this is indeed a race, although it is safe for "stealing" case, in lnet_ptl_attach_md() 669 exhausted = (rc & LNET_MATCHMD_EXHAUSTED) != 0;
670 if ((rc & LNET_MATCHMD_NONE) != 0) {
671 if (exhausted)
672 break;
673 continue;
674 }
675
676 /* Hurrah! This _is_ a match */
677 LASSERT((rc & LNET_MATCHMD_FINISH) != 0);
678 list_del_init(&msg->msg_list);
679
680 if (head == &ptl->ptl_msg_stealing) {
681 if (exhausted)
682 break;
683 /* stealing thread will handle the message */
684 continue;
685 }
686
687 if ((rc & LNET_MATCHMD_OK) != 0) {
688 list_add_tail(&msg->msg_list, matches);
689
690 CDEBUG(D_NET, "Resuming delayed PUT from %s portal %d "
691 "match "LPU64" offset %d length %d.\n",
692 libcfs_id2str(info.mi_id),
693 info.mi_portal, info.mi_mbits,
694 info.mi_roffset, info.mi_rlength);
695 } else {
696 list_add_tail(&msg->msg_list, drops);
697 }
698
Please check L680-685, it will not add message to "matched" list if another thread is trying to "steal" buffer. But it is unsafe for "delayed" message, lnet_parse_put() should never check the "delayed" message again because it can be taken over by another thread, I've posted a patch at here: http://review.whamcloud.com/#/c/17065/ |
| Comment by Bruno Faccini (Inactive) [ 08/Nov/15 ] |
|
Hello Liang, |
| Comment by Liang Zhen (Inactive) [ 09/Nov/15 ] |
|
HI Bruno, as described in my previous comment, L680-L685 will not take over the message on stealing list, instead it only unlinks it from the stealing list without moving it to "matches" list. So the thread who is "stealing" buffer still owns the message, it can find out this message has a matched MD, and continue recv_put().
I think our problem is the second case, because there is no context switch for the first case. Is this reasonable? |
| Comment by Bruno Faccini (Inactive) [ 16/Nov/15 ] |
|
Hello Liang, |
| Comment by Liang Zhen (Inactive) [ 18/Nov/15 ] |
|
I think the next ""if (!list_empty(&msg->msg_list))" should be checked with hold of lock right? so the question can be simplified to:
I think it is safe right? |
| Comment by Bruno Faccini (Inactive) [ 18/Nov/15 ] |
|
This time (for the "stealing" case), I did not mean it is unsafe on a link-list coherency point of view, but seems to me that the fact lnet_ptl_attach_md() can unlink a msg from ptl_msg_stealing list, may break the algorithm in lnet_ptl_match_md()/lnet_ptl_match_delay() ... But you know it much better than me |
| Comment by Gerrit Updater [ 24/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17065/ |
| Comment by Joseph Gmitter (Inactive) [ 24/Nov/15 ] |
|
Landed for 2.8 |
| Comment by Bruno Travouillon (Inactive) [ 14/Dec/15 ] |
|
Hello, FYI, we provided a new 2.5 build to the customer a few weeks ago, with a backport of patch http://review.whamcloud.com/17065/. They have not been able to reproduce this issue on their test filesystem, with or w/o the patch. They perform the update of their production filesystem last Tuesday. A new crash occurred on Friday (similar to this issue AFAIK, ie a race on lnet_msg). I have not been able to review the vmcore yet, but it seems that the patch does not completely fix the issue. |
| Comment by Olaf Weber [ 15/Dec/15 ] |
|
I've been interested in this ticket because I've been looking at the msg lifecycle. A few comments The site of the reported panic corresponds to line 500 below. As I understand it, the concern with messages added to the ptl_msg_delayed list at line 520-521 below is that once the ptl and res locks have been dropped at line 528, an arbitrary amount of processing can be done on msg before the checks at line 530 are executed. In particular this implies that the check of msg->msg_rx_delayed may be checking this flag after the memory has been re-used and overwritten, which could send the code through another iteration of the loop when it should have exited the loop instead. Note that the check of msg->msg_rx_delayed in lnet_ptl_match_md() is similarly suspect, because it follows the call of lnet_ptl_match_delay(), but at least it only decides whether a debug message should be emitted. 462 static int
463 lnet_ptl_match_delay(struct lnet_portal *ptl,
464 struct lnet_match_info *info, struct lnet_msg *msg)
465 {
466 int first = ptl->ptl_mt_maps[0]; /* read w/o lock */
467 int rc = 0;
468 int i;
469
470 /* steal buffer from other CPTs, and delay it if nothing to steal,
471 * this function is more expensive than a regular match, but we
472 * don't expect it can happen a lot */
473 LASSERT(lnet_ptl_is_wildcard(ptl));
474
475 for (i = 0; i < LNET_CPT_NUMBER; i++) {
476 struct lnet_match_table *mtable;
477 int cpt;
478
479 cpt = (first + i) % LNET_CPT_NUMBER;
480 mtable = ptl->ptl_mtables[cpt];
481 if (i != 0 && i != LNET_CPT_NUMBER - 1 && !mtable->mt_enabled)
482 continue;
483
484 lnet_res_lock(cpt);
485 lnet_ptl_lock(ptl);
486
487 if (i == 0) { /* the first try, attach on stealing list */
488 list_add_tail(&msg->msg_list,
489 &ptl->ptl_msg_stealing);
490 }
491
492 if (!list_empty(&msg->msg_list)) { /* on stealing list */
493 rc = lnet_mt_match_md(mtable, info, msg);
494
495 if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
496 mtable->mt_enabled)
497 lnet_ptl_disable_mt(ptl, cpt);
498
499 if ((rc & LNET_MATCHMD_FINISH) != 0)
* 500 list_del_init(&msg->msg_list);
501
502 } else {
503 /* could be matched by lnet_ptl_attach_md()
504 * which is called by another thread */
505 rc = msg->msg_md == NULL ?
506 LNET_MATCHMD_DROP : LNET_MATCHMD_OK;
507 }
508
509 if (!list_empty(&msg->msg_list) && /* not matched yet */
510 (i == LNET_CPT_NUMBER - 1 || /* the last CPT */
511 ptl->ptl_mt_nmaps == 0 || /* no active CPT */
512 (ptl->ptl_mt_nmaps == 1 && /* the only active CPT */
513 ptl->ptl_mt_maps[0] == cpt))) {
514 /* nothing to steal, delay or drop */
515 list_del_init(&msg->msg_list);
516
517 if (lnet_ptl_is_lazy(ptl)) {
518 msg->msg_rx_delayed = 1;
519 list_add_tail(&msg->msg_list,
520 &ptl->ptl_msg_delayed);
521 rc = LNET_MATCHMD_NONE;
522 } else {
523 rc = LNET_MATCHMD_DROP;
524 }
525 }
526
527 lnet_ptl_unlock(ptl);
* 528 lnet_res_unlock(cpt);
529
* 530 if ((rc & LNET_MATCHMD_FINISH) != 0 || msg->msg_rx_delayed)
531 break;
532 }
533
534 return rc;
535 }
|
| Comment by Bruno Faccini (Inactive) [ 21/Dec/15 ] |
|
Being on-site today, I have been able to do some analysis of the 2 crash-dumps from the 2 new crash occurrences for this issue, running with patch http://review.whamcloud.com/17065/. Here are the details from my analysis : So this clearly indicates that all racy code paths have not been fixed, as we already suspected before. |
| Comment by Gerrit Updater [ 06/Jan/16 ] |
|
Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/17840 |
| Comment by Bruno Faccini (Inactive) [ 06/Jan/16 ] |
|
Hello Olaf, |
| Comment by Gerrit Updater [ 06/Jan/16 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/17847 |
| Comment by Olaf Weber [ 06/Jan/16 ] |
|
Hi Bruno, |
| Comment by Olaf Weber [ 08/Jan/16 ] |
|
Hi Bruno, FWIW, what I understand the code attempts to do is as follows: when looking for a matching MD for an incoming message, it may be the case that there is no such MD associated with the current CPT, as determined by lnet_mt_match_md(). At that point there are two choices: receive the message anyway and cache it locally until an MD is available, or delay receiving the message until and MD becomes available. If there is an lnd_t::lnd_eager_recv() callout then the message is received from the network anyway, and kept around on the node until an MD is attached to the portal. This is not the case that results in the failures discussed in this LU. If there is no lnd_t::lnd_eager_recv() then lnet_ptl_match_md() calls lnet_ptl_match_delay() to look at the match tables for other CPTs, to see if a matching MD can be found that way. And if there is none, it puts the message on the lnet_portal_t::ptl_msg_delayed queue to be processed when a matching MD is attached by lnet_ptl_attach_md(). So lnet_ptl_match_delay() and lnet_ptl_attach_md() need to synchronize. The lnet_portal_t::ptl_msg_delayed and lnet_portal_t::ptl_msg_stealing queues are used for that. In lnet_ptl_match_delay() the message is put on the stealing queue so that lnet_ptl_attach_md() can signal that an MD is now available and that the message can be received and processed. lnet_ptl_attach_md() does this by removing it from the stealing queue, and is careful not to touch the message after that. In lnet_ptl_match_delay() the message is put on the delay queue so that lnet_ptl_attach_md() can pick it up to complete receiving and processing it. Once it has added a message to the delay queue and dropping the lnet_ptl_lock and lnet_res_lock lnet_ptl_match_delay() has lost control over the message and should no longer reference it, and the same applies to its callers. Unfortunately in the current code it does reference the message after dropping the locks. I believe this is the cause of the crashes. Regarding lnet_ptl_match_delay() there are several comments in the code saying that the "delay" case is not expected to happen a lot. That this race condition can be fairly reliably hit argues that some workloads exercise it regularly. And that in turn suggest that the worries (also expressed in the comments) that this code path is comparatively expensive are valid. The amount of work done in the loop in lnet_ptl_match_delay() can be considerable, which is why I think doing the entire loop under lnet_res_lock(LNET_LOCK_EX) might not be a good idea. |
| Comment by Bruno Faccini (Inactive) [ 08/Jan/16 ] |
|
Hello Olaf, I think your analysis of the code involved with this issue is pretty correct, as far as I also think to understand it! |
| Comment by Olaf Weber [ 08/Jan/16 ] |
|
Hi Bruno, Can you explain in a bit more detail why you think lnet_res_lock(LNET_LOCK_EX) is required? As far as I can tell, avoiding any reference to a message on the delay queue is sufficient. |
| Comment by Bruno Faccini (Inactive) [ 18/Jan/16 ] |
|
Hello Olaf, And now I can tell that you were already far ahead than me in doing so and that I am now pretty convinced that your patch should make it, and particularly the change to no longer access msg->msg_rx_delayed outside of the lnet_ptl_[un]lock()/lnet_res_[un]lock() protection but better using/checking the LNET_MATCHMD_NONE return-value instead, in both lnet_ptl_match_md() and lnet_ptl_match_delay() routines. My previous doubts were coming from the analysis we did for the original crashes reported for this ticket, and where the corruption (re-use after free in fact) was causing a GPF in lnet_ptl_match_delay() when trying to list_del_init(&msg->msg_list), after !list_empty(&msg->msg_list) condition has been verified. But I am now convinced that this can come from the fact, after a re-use after free mainly zeroing old lnet_msg content, msg->msg_rx_delayed could have lead to an unexpected additional loop! Thus I have abandonned http://review.whamcloud.com/17847, and since your http://review.whamcloud.com/17840 has triggered some unexpected and related errors I have re-started its auto-tests session. |
| Comment by Gerrit Updater [ 21/Jan/16 ] |
|
Please ignore this patch, Patch from Olaf is essentially same, I will review his patch. Sorry I didn't read through the full comments. (Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/18081 (abandoned) |
| Comment by Olaf Weber [ 21/Jan/16 ] |
|
Hi Liang. I had considered adding an LNET_MATCHMD_DELAYED code, and do believe that approach would work. I ultimately decided against it in my proposal because I preferred to not change the interface of lnet_ptl_match_md(). |
| Comment by Liang Zhen (Inactive) [ 21/Jan/16 ] |
|
Olaf, yes I think your patch is quite reasonable, thanks. |
| Comment by Gerrit Updater [ 25/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17840/ |
| Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ] |
|
Landed for 2.8.0 |