[LU-7324] Race condition on deleting lnet_msg Created: 21/Oct/15  Updated: 27/Jan/17  Resolved: 25/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Sebastien Piechurski Assignee: Henri Doreau (Inactive)
Resolution: Fixed Votes: 0
Labels: p4b, patch
Environment:

bull lustre 2.5.3.90


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We hit several times on different OSS what looks like a race condition on freeing an lnet_msg.
The crash looks as follows:

LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa0602924>] lnet_ptl_match_md+0x3b4/0x870 [lnet]
PGD 0
Oops: 0002 1 SMP

with this stack:

crash> bt
PID: 2163   TASK: ffff8803b9df2040  CPU: 25  COMMAND: "kiblnd_sd_01_02"
 #0 [ffff880340ceb7e0] machine_kexec at ffffffff8103d30b
 #1 [ffff880340ceb840] crash_kexec at ffffffff810cc4f2
 #2 [ffff880340ceb910] oops_end at ffffffff8153d3d0
 #3 [ffff880340ceb940] no_context at ffffffff8104e8cb
 #4 [ffff880340ceb990] __bad_area_nosemaphore at ffffffff8104eb55
 #5 [ffff880340ceb9e0] bad_area_nosemaphore at ffffffff8104ec23
 #6 [ffff880340ceb9f0] __do_page_fault at ffffffff8104f31c
 #7 [ffff880340cebb10] do_page_fault at ffffffff8153f31e
 #8 [ffff880340cebb40] page_fault at ffffffff8153c6c5
    [exception RIP: lnet_ptl_match_md+948]
    RIP: ffffffffa0602924  RSP: ffff880340cebbf0  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff880340cebcf0  RCX: ffff880c75631940
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffff880bc4ed4550
    RBP: ffff880340cebc70   R8: ffff880bc4ed4550   R9: a500000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff880749841800
    R13: 0000000000000002  R14: ffff881078f8b2c0  R15: 0000000000000002
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880340cebc78] lnet_parse at ffffffffa0609fb3 [lnet]
#10 [ffff880340cebd58] kiblnd_handle_rx at ffffffffa09f19db [ko2iblnd]
#11 [ffff880340cebda8] kiblnd_rx_complete at ffffffffa09f26c3 [ko2iblnd]
#12 [ffff880340cebdf8] kiblnd_complete at ffffffffa09f2872 [ko2iblnd]
#13 [ffff880340cebe08] kiblnd_scheduler at ffffffffa09f2c2a [ko2iblnd]
#14 [ffff880340cebee8] kthread at ffffffff810a101e
#15 [ffff880340cebf48] kernel_thread at ffffffff8100c28a

The crash seems to occur here:
lnet_ptl_match_delay(struct lnet_portal *ptl,
             struct lnet_match_info *info, struct lnet_msg *msg)
{
...
        if (!cfs_list_empty(&msg->msg_list)) { /* on stealing list */
            rc = lnet_mt_match_md(mtable, info, msg);

            if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
                mtable->mt_enabled)
                lnet_ptl_disable_mt(ptl, cpt);

            if ((rc & LNET_MATCHMD_FINISH) != 0)
                cfs_list_del_init(&msg->msg_list); <=== CRASH (msg->msg_list == NULL)
....

Can you please help with analyzing what can cause the race ?



 Comments   
Comment by Joseph Gmitter (Inactive) [ 21/Oct/15 ]

Hi Amir,
Can you please have a look at this issue?
Thanks.
Joe

Comment by Bruno Faccini (Inactive) [ 04/Nov/15 ]

Ok, here are more details after on-site analysis of some of the multiple crash-dumps for what looks to be the same issue.

Some HDF5/MPIIO user programs have been identified to quite easily reproduce the issue, but always at large scale on 2048/4096 cores. Attempts to reduce have been unsuccessful until now.

All these crashes have been caused by a use after free of a lnet_msg, and they are almost always BUG()s (of same signature described in
description text) or LBUG()s (when running with D_MALLOC+D_NET debug traces enabled !!) occuring in a kiblnd_sd thread's context.

The code always involved in these crashes is in lnet_ptl_match_md()/lnet_ptl_match_delay() (inlined)/lnet_ni_eager_recv() routines area.

Interesting is the frequent LBUG that seem to occur, in an ll_ost_create thread, almost any times before the BUG() and the concerned thread's stack contains also a reference to the same memory location than the lnet_msg...

More interesting is that in order to better track the highly suspected "use after free" of lnet_msg, we have enabled D_MALLOC+D_NET debug traces and have tried to reproduce.
Then we got the following LBUG instead of the "usual" BUG() already described :

0>LNetError: 32130:0:(lib-move.c:706:lnet_ni_eager_recv()) ASSERTION( msg->msg_receiving ) failed:
<0>LNetError: 32130:0:(lib-move.c:706:lnet_ni_eager_recv()) LBUG
<4>Pid: 32130, comm: kiblnd_sd_00_01
<4>
<4>Call Trace:
<4> [<ffffffffa0521895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0521e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa062f7c0>] lnet_ni_eager_recv+0x1d0/0x210 [lnet]
<4> [<ffffffffa06356b1>] lnet_parse+0x1161/0x1870 [lnet]
<4> [<ffffffff8129b8f4>] ? snprintf+0x34/0x40
<4> [<ffffffffa0a1c9db>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
<4> [<ffffffffa0a1d6c3>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
<4> [<ffffffffa0a1d872>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
<4> [<ffffffffa0a1dc2a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
<4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0a1d8f0>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
<4> [<ffffffff810a101e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

this is again caused by a lnet_msg (0xffff880349d12c00) that is found to be corrupted (mainly zero'ed), but this time, the debug log extracted from the crash-dump can confirm that the message has been freed and even reallocated several times by the same ll_ost_create/26433 thread (and even later by others and as a "ctx->lc_value" in keys_init()/keys_fini() routines, due to same "size-512" anonymous kmem_cache usage, and which seems to be the same context than the frequent companion LBUG!!!) during the time window where the kiblnd_sd_00_01/32130 thread was still referencing an using it, as we can see in following debug trace extract that has been egrep'ed to only show 26433 and 32130 threads concurrent activity around the LBUG :

00000400:00000010:0.0:1446151976.446470:0:32130:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880349d12c00 (tot 480051974).
00000400:00000010:16.0:1446151976.446472:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320ed55c0 (tot 480052358).
00000020:00000010:16.0:1446151976.446481:0:26433:0:(lu_object.c:1745:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff88034cca8c00.
02000000:00000010:16.0:1446151976.446483:0:26433:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880320ed5cc0.
00000020:00000010:16.0:1446151976.446486:0:26433:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880320ed5e40.
00000400:00000200:0.0:1446151976.446488:0:32130:0:(lib-ptl.c:546:lnet_ptl_match_md()) Request from 12345-AA.CCC.VV.UUU@o2ib2 of length 224 into portal 7 MB=0x56138e94337c0
00000020:00000010:16.0:1446151976.446488:0:26433:0:(tgt_main.c:131:tgt_ses_key_init()) kmalloced 'value': 56 at ffff88030bfa7440.
02000000:00000010:16.0:1446151976.446493:0:26433:0:(sec_null.c:331:null_alloc_rs()) kmalloced 'rs': 632 at ffff8802c1708c00.
00010000:00000200:16.0:1446151976.446496:0:26433:0:(ldlm_lib.c:2438:target_send_reply_msg()) @@@ sending reply  req@ffff88032934a050 x1514271791306720/t0(0) o13->9a0f23d4-d572-d43b-be8a-a3e62039fe4a@AA.BBB.XX.Y@o2ib3:0/0 lens 224/368 e 0 
to 0 dl 1446151993 ref 1 fl Interpret:/0/0 rc 0/0
00000400:00000010:16.0:1446151976.446501:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff88032aa23c40 (tot 480053254).
00000400:00000010:16.0:1446151976.446538:0:26433:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880347bf3000 (tot 480054438).
00000400:00000200:16.0:1446151976.446547:0:26433:0:(lib-move.c:2189:LNetPut()) LNetPut -> 12345-AA.BBB.XX.Y@o2ib3
02000000:00000010:16.0:1446151976.446549:0:26433:0:(capa.c:416:lu_capainfo_key_fini()) kfreed 'info': 128 at ffff880320ed5cc0.
00000020:00000010:16.0:1446151976.446551:0:26433:0:(lu_ucred.c:53:lu_ucred_key_fini()) kfreed 'info': 72 at ffff880320ed5e40.
00000020:00000010:16.0:1446151976.446552:0:26433:0:(tgt_main.c:131:tgt_ses_key_fini()) kfreed 'info': 56 at ffff88030bfa7440.
00000020:00000010:16.0:1446151976.446554:0:26433:0:(lu_object.c:1675:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff88034cca8c00.
00000400:00000010:16.0:1446151976.446558:0:26433:0:(lib-lnet.h:500:lnet_me_alloc()) alloc '(me)': 96 at ffff880320ed5e40 (tot 480055654).
00000400:00000010:16.0:1446151976.446569:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320ed5cc0 (tot 480056806).
00000400:00000200:16.0:1446151976.446590:0:26433:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 7 from 12345-AA.CCC.VV.UUU@o2ib2 of length 224/224 into md 0x756560e1 [1] + 0
00000400:00000010:16.0:1446151976.446592:0:26433:0:(lib-lnet.h:508:lnet_me_free()) kfreed 'me': 96 at ffff880320ed5e40 (tot 480058214).
00000400:00000200:16.0:1446151976.446594:0:26433:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ffff880320ed5cc0
00000400:00000200:16.0:1446151976.446596:0:26433:0:(lib-ptl.c:694:lnet_ptl_attach_md()) Resuming delayed PUT from 12345-AA.CCC.VV.UUU@o2ib2 portal 7 match 1514271943112640 offset 0 length 224.
00000400:00000200:16.0:1446151976.446599:0:26433:0:(lib-move.c:2092:lnet_recv_delayed_msg_list()) Resuming delayed PUT from 12345-AA.CCC.VV.UUU@o2ib2 portal 7 match 1514271943112640 offset 0 length 224.
00000400:00000200:16.0:1446151976.446609:0:26433:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff880320ed5cc0
00000400:00000010:16.0:1446151976.446610:0:26433:0:(lib-lnet.h:491:lnet_md_free()) kfreed 'md': 128 at ffff880320ed5cc0 (tot 480058054).
00000400:00000010:16.0:1446151976.446612:0:26433:0:(lib-lnet.h:528:lnet_msg_free()) kfreed 'msg': 352 at ffff880349d12c00 (tot 480057702).
00000800:00000200:16.0:1446151976.446614:0:26433:0:(o2iblnd_cb.c:185:kiblnd_post_rx()) conn[ffff88043cb91000] (23)++
00000800:00000200:16.0:1446151976.446615:0:26433:0:(o2iblnd_cb.c:972:kiblnd_check_sends()) conn[ffff88043cb91000] (24)++
00000800:00000200:16.0:1446151976.446617:0:26433:0:(o2iblnd_cb.c:999:kiblnd_check_sends()) conn[ffff88043cb91000] (25)--
00000800:00000200:16.0:1446151976.446618:0:26433:0:(o2iblnd_cb.c:214:kiblnd_post_rx()) conn[ffff88043cb91000] (24)--
00000020:00000010:16.0:1446151976.446621:0:26433:0:(lu_object.c:1745:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880349d12c00.
02000000:00000010:16.0:1446151976.446623:0:26433:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880320ed5cc0.
00000020:00000010:16.0:1446151976.446624:0:26433:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880320ed5e40.
00000020:00000010:16.0:1446151976.446625:0:26433:0:(tgt_main.c:131:tgt_ses_key_init()) kmalloced 'value': 56 at ffff88030bfa7440.
02000000:00000010:16.0:1446151976.446629:0:26433:0:(sec_null.c:331:null_alloc_rs()) kmalloced 'rs': 632 at ffff8802c1708800.
00010000:00000200:16.0:1446151976.446632:0:26433:0:(ldlm_lib.c:2438:target_send_reply_msg()) @@@ sending reply  req@ffff880320839050 x1514271650790784/t0(0) o13->4b00db46-7119-e7ab-b4f7-f1ffe607543e@AA.CCC.WW.ZZ@o2ib2:0/0 lens 224/368 e 0
 to 0 dl 1446151993 ref 1 fl Interpret:/0/0 rc 0/0
00000400:00000010:16.0:1446151976.446637:0:26433:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff88032aa23f40 (tot 480058022).
00000400:00000010:16.0:1446151976.446646:0:26433:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff88034cca8c00 (tot 480058758).
00000400:00000200:16.0:1446151976.446665:0:26433:0:(lib-move.c:2189:LNetPut()) LNetPut -> 12345-AA.CCC.WW.ZZ@o2ib2
02000000:00000010:16.0:1446151976.446668:0:26433:0:(capa.c:416:lu_capainfo_key_fini()) kfreed 'info': 128 at ffff880320ed5cc0.
00000020:00000010:16.0:1446151976.446669:0:26433:0:(lu_ucred.c:53:lu_ucred_key_fini()) kfreed 'info': 72 at ffff880320ed5e40.
00000020:00000010:16.0:1446151976.446671:0:26433:0:(tgt_main.c:131:tgt_ses_key_fini()) kfreed 'info': 56 at ffff88030bfa7440.
00000020:00000010:16.0:1446151976.446672:0:26433:0:(lu_object.c:1675:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880349d12c00.
00000400:00000200:0.0:1446151976.446681:0:32130:0:(lib-ptl.c:603:lnet_ptl_match_md()) Delaying PUT from 12345-AA.CCC.VV.UUU@o2ib2 ptl 7 MB 0x56138e94337c0 off 0 len 224
00000400:00040000:0.0:1446151976.446703:0:32130:0:(lib-move.c:706:lnet_ni_eager_recv()) ASSERTION( msg->msg_receiving ) failed: 

The fact that the kiblnd_sd_00_01/32130 thread appears not been running/scheduled during some time is a bit curious but could be only a pure kernel/scheduling consequence (or intermediate block on either ?), but anyway, this clearly indicate that there is a window where a thread can manipulate a lnet_msg that could be freed in the meantime.
Then, having a look to the concerned source code pointed by the debug traces, here is a possible scenario where kiblnd_sd/32130 thread has been executing the following code :

  lnet_parse()
         lnet_parse_put()
                 again: lnet_ptl_match_md()
                           lnet_ptl_match_delay()
                              where :
                                 cfs_list_add_tail(&msg->msg_list,
                                             &ptl->ptl_msg_stealing);
                              or
                                  cfs_list_del_init(&msg->msg_list);

                              or
                                 msg->msg_rx_delayed = 1;
                                 list_add_tail(&msg->msg_list,
                                              &ptl->ptl_msg_delayed);

                              all of this in a for(;;) loop  
                              with lnet_[ptl,res]_[lock,unlock]()
                              around each iteration

<<<<<< some delay has occured >>>>>>>

                 lnet_recv_put(ni, msg)
                     or
                 lnet_ni_eager_recv(ni, msg)
                 goto again

When during the delay where the kiblnd_sd thread has been disabled from running or being scheduled, the ll_ost_create/26433 thread may have executed the following code :

 LNetMDAttach()                                                                                                                                          
       lnet_ptl_attach_md()
           lnet_ptl_lock()
               again:  
                    where msgs are respectively moved from
                    ptl_msg_stealing then ptl_msg_delayed lists
                    to a private/"matches" list

                    goto again                                                                 
           lnet_ptl_unlock()                                                               

       lnet_recv_delayed_msg_list(&matches)
            lnet_recv_put(msg)  
                lnet_ni_recv()                
                     kiblnd_recv()                       
                          lnet_finalize()                          
                                lnet_complete_msg_locked()                                                          
                                        lnet_msg_free(msg)                                                                  
                                             then msg memory                                                                   
                                             can be re-alloc!!                                                                 

So seems to me that to be safe, at least the lnet_ptl_[un]lock() protection must surround the whole lnet_msg references in lnet_parse_put().
What do you think Amir ?

Comment by Bruno Faccini (Inactive) [ 05/Nov/15 ]

I am raising the priority to Critical, since due to this problem site suffers one or two OSSs crashes per day now.

Comment by Gerrit Updater [ 06/Nov/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17065
Subject: LU-7324 lnet: recv could access freed message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e15147ee15a8c50da638d49affacbc758f523c4

Comment by Liang Zhen (Inactive) [ 06/Nov/15 ]

Hi Bruno, thanks for digging into code, this is indeed a race, although it is safe for "stealing" case, in lnet_ptl_attach_md()

669                 exhausted = (rc & LNET_MATCHMD_EXHAUSTED) != 0;
670                 if ((rc & LNET_MATCHMD_NONE) != 0) {
671                         if (exhausted)
672                                 break;
673                         continue;
674                 }
675 
676                 /* Hurrah! This _is_ a match */
677                 LASSERT((rc & LNET_MATCHMD_FINISH) != 0);
678                 list_del_init(&msg->msg_list);
679 
680                 if (head == &ptl->ptl_msg_stealing) {
681                         if (exhausted)
682                                 break;
683                         /* stealing thread will handle the message */
684                         continue;
685                 }
686 
687                 if ((rc & LNET_MATCHMD_OK) != 0) {
688                         list_add_tail(&msg->msg_list, matches);
689 
690                         CDEBUG(D_NET, "Resuming delayed PUT from %s portal %d "
691                                "match "LPU64" offset %d length %d.\n",
692                                libcfs_id2str(info.mi_id),
693                                info.mi_portal, info.mi_mbits,
694                                info.mi_roffset, info.mi_rlength);
695                 } else {
696                         list_add_tail(&msg->msg_list, drops);
697                 }
698 

Please check L680-685, it will not add message to "matched" list if another thread is trying to "steal" buffer.

But it is unsafe for "delayed" message, lnet_parse_put() should never check the "delayed" message again because it can be taken over by another thread, I've posted a patch at here: http://review.whamcloud.com/#/c/17065/

Comment by Bruno Faccini (Inactive) [ 08/Nov/15 ]

Hello Liang,
Thanks for your help and fix proposal already !
But seems to me that there is also a potential race during stealing, when lnet_ptl_match_delay() can add a msg on ptl_msg_stealing, then lnet_[ptl,res]_[lock,unlock]() before next loop, and have lnet_ptl_match_md() to lnet_ptl_lock() and then list_del_init(&msg->msg_list) ???

Comment by Liang Zhen (Inactive) [ 09/Nov/15 ]

HI Bruno, as described in my previous comment, L680-L685 will not take over the message on stealing list, instead it only unlinks it from the stealing list without moving it to "matches" list. So the thread who is "stealing" buffer still owns the message, it can find out this message has a matched MD, and continue recv_put().
Or let's say:

  • stealing buffer: I want to receive this message if possible, so I try to steal buffer from other queues (busy try). If I got a buffer for this message, I will handle it right away. If another thread found I have put a message on stealing list, he should not take over this message from me even he can give me a buffer for it.
  • delay: I'm giving up on this message, I will put it on delayed list and forget about it, so another thread can take over it when it has buffer.

I think our problem is the second case, because there is no context switch for the first case. Is this reasonable?

Comment by Bruno Faccini (Inactive) [ 16/Nov/15 ]

Hello Liang,
Sorry to be back on this late.
My concern about the "stealing case" in lnet_ptl_match_delay() is that it releases both lnet_[res,ptl]_lock() between each LNET_CPT_NUMBER loop, this leaves a potential window where an other thread can run lnet_ptl_attach_md() causing lnet_msg to be list_del_init(), then don't you think that the next "if (!list_empty(&msg->msg_list))" tests will be fooled ?

Comment by Liang Zhen (Inactive) [ 18/Nov/15 ]

I think the next ""if (!list_empty(&msg->msg_list))" should be checked with hold of lock right? so the question can be simplified to:

  • thread-A takes the lock and add the message on a list, then release the lock
  • thread-B takes the same lock and delete the message from a list then release the lock
  • thread-A takes the lock and check if the message is still on the list

I think it is safe right?

Comment by Bruno Faccini (Inactive) [ 18/Nov/15 ]

This time (for the "stealing" case), I did not mean it is unsafe on a link-list coherency point of view, but seems to me that the fact lnet_ptl_attach_md() can unlink a msg from ptl_msg_stealing list, may break the algorithm in lnet_ptl_match_md()/lnet_ptl_match_delay() ... But you know it much better than me

Comment by Gerrit Updater [ 24/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17065/
Subject: LU-7324 lnet: recv could access freed message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 01a6f01a4d26831d055552a113ff0fac9547f0aa

Comment by Joseph Gmitter (Inactive) [ 24/Nov/15 ]

Landed for 2.8

Comment by Bruno Travouillon (Inactive) [ 14/Dec/15 ]

Hello,

FYI, we provided a new 2.5 build to the customer a few weeks ago, with a backport of patch http://review.whamcloud.com/17065/. They have not been able to reproduce this issue on their test filesystem, with or w/o the patch. They perform the update of their production filesystem last Tuesday. A new crash occurred on Friday (similar to this issue AFAIK, ie a race on lnet_msg). I have not been able to review the vmcore yet, but it seems that the patch does not completely fix the issue.

Comment by Olaf Weber [ 15/Dec/15 ]

I've been interested in this ticket because I've been looking at the msg lifecycle. A few comments

The site of the reported panic corresponds to line 500 below.

As I understand it, the concern with messages added to the ptl_msg_delayed list at line 520-521 below is that once the ptl and res locks have been dropped at line 528, an arbitrary amount of processing can be done on msg before the checks at line 530 are executed. In particular this implies that the check of msg->msg_rx_delayed may be checking this flag after the memory has been re-used and overwritten, which could send the code through another iteration of the loop when it should have exited the loop instead.

Note that the check of msg->msg_rx_delayed in lnet_ptl_match_md() is similarly suspect, because it follows the call of lnet_ptl_match_delay(), but at least it only decides whether a debug message should be emitted.

    462 static int
    463 lnet_ptl_match_delay(struct lnet_portal *ptl,
    464                      struct lnet_match_info *info, struct lnet_msg *msg)
    465 {
    466         int     first = ptl->ptl_mt_maps[0]; /* read w/o lock */
    467         int     rc = 0;
    468         int     i;
    469 
    470         /* steal buffer from other CPTs, and delay it if nothing to steal,
    471          * this function is more expensive than a regular match, but we
    472          * don't expect it can happen a lot */
    473         LASSERT(lnet_ptl_is_wildcard(ptl));
    474 
    475         for (i = 0; i < LNET_CPT_NUMBER; i++) {
    476                 struct lnet_match_table *mtable;
    477                 int                     cpt;
    478 
    479                 cpt = (first + i) % LNET_CPT_NUMBER;
    480                 mtable = ptl->ptl_mtables[cpt];
    481                 if (i != 0 && i != LNET_CPT_NUMBER - 1 && !mtable->mt_enabled)
    482                         continue;
    483 
    484                 lnet_res_lock(cpt);
    485                 lnet_ptl_lock(ptl);
    486 
    487                 if (i == 0) { /* the first try, attach on stealing list */
    488                         list_add_tail(&msg->msg_list,
    489                                       &ptl->ptl_msg_stealing);
    490                 }
    491 
    492                 if (!list_empty(&msg->msg_list)) { /* on stealing list */
    493                         rc = lnet_mt_match_md(mtable, info, msg);
    494 
    495                         if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
    496                             mtable->mt_enabled)
    497                                 lnet_ptl_disable_mt(ptl, cpt);
    498 
    499                         if ((rc & LNET_MATCHMD_FINISH) != 0)
*   500                                 list_del_init(&msg->msg_list);
    501 
    502                 } else {
    503                         /* could be matched by lnet_ptl_attach_md()
    504                          * which is called by another thread */
    505                         rc = msg->msg_md == NULL ?
    506                              LNET_MATCHMD_DROP : LNET_MATCHMD_OK;
    507                 }
    508 
    509                 if (!list_empty(&msg->msg_list) && /* not matched yet */
    510                     (i == LNET_CPT_NUMBER - 1 || /* the last CPT */
    511                      ptl->ptl_mt_nmaps == 0 ||   /* no active CPT */
    512                      (ptl->ptl_mt_nmaps == 1 &&  /* the only active CPT */
    513                       ptl->ptl_mt_maps[0] == cpt))) {
    514                         /* nothing to steal, delay or drop */
    515                         list_del_init(&msg->msg_list);
    516 
    517                         if (lnet_ptl_is_lazy(ptl)) {
    518                                 msg->msg_rx_delayed = 1;
    519                                 list_add_tail(&msg->msg_list,
    520                                               &ptl->ptl_msg_delayed);
    521                                 rc = LNET_MATCHMD_NONE;
    522                         } else {
    523                                 rc = LNET_MATCHMD_DROP;
    524                         }
    525                 }
    526 
    527                 lnet_ptl_unlock(ptl);
*   528                 lnet_res_unlock(cpt);
    529 
*   530                 if ((rc & LNET_MATCHMD_FINISH) != 0 || msg->msg_rx_delayed)
    531                         break;
    532         }
    533 
    534         return rc;
    535 }
Comment by Bruno Faccini (Inactive) [ 21/Dec/15 ]

Being on-site today, I have been able to do some analysis of the 2 crash-dumps from the 2 new crash occurrences for this issue, running with patch http://review.whamcloud.com/17065/.

Here are the details from my analysis :
1) the 2 crash signatures are the same than the BUG/Oops described in the Description text for this issue.
2) ready_delay new variable (introduced by patch http://review.whamcloud.com/17065/) value is 1/True.

So this clearly indicates that all racy code paths have not been fixed, as we already suspected before.

Comment by Gerrit Updater [ 06/Jan/16 ]

Olaf Weber (olaf@sgi.com) uploaded a new patch: http://review.whamcloud.com/17840
Subject: LU-7324 lnet: Use after free in lnet_ptl_match_delay()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aafddc2a0c62167851bb88be627fa218c04c4199

Comment by Bruno Faccini (Inactive) [ 06/Jan/16 ]

Hello Olaf,
Even if I think your patch will help in some situations, I don't think it can cover the situation/race causing the original crashes reported in this ticket.
I will give a try to a more restrictive/conservative solution than your.

Comment by Gerrit Updater [ 06/Jan/16 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/17847
Subject: LU-7324 lnet: avoid race on ptl_msg_[stealing,delayed] lists
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a26b154424a56fe3915e5c14056cc04d4500f54

Comment by Olaf Weber [ 06/Jan/16 ]

Hi Bruno,
Your proposed change prevents additions to the ptl_msg_stealing list – anything added by lnet_ptl_match_delay() will be removed as well before the lnet_res_lock or lnet_ptl_lock are dropped, and lnet_ptl_match_delay() is the only routine that add to this list. Therefore that field and all code relating to it should be removed as well. This might even be a good idea, but I don't think it qualifies as a conservative change.

Comment by Olaf Weber [ 08/Jan/16 ]

Hi Bruno,

FWIW, what I understand the code attempts to do is as follows: when looking for a matching MD for an incoming message, it may be the case that there is no such MD associated with the current CPT, as determined by lnet_mt_match_md(). At that point there are two choices: receive the message anyway and cache it locally until an MD is available, or delay receiving the message until and MD becomes available.

If there is an lnd_t::lnd_eager_recv() callout then the message is received from the network anyway, and kept around on the node until an MD is attached to the portal. This is not the case that results in the failures discussed in this LU.

If there is no lnd_t::lnd_eager_recv() then lnet_ptl_match_md() calls lnet_ptl_match_delay() to look at the match tables for other CPTs, to see if a matching MD can be found that way. And if there is none, it puts the message on the lnet_portal_t::ptl_msg_delayed queue to be processed when a matching MD is attached by lnet_ptl_attach_md().

So lnet_ptl_match_delay() and lnet_ptl_attach_md() need to synchronize. The lnet_portal_t::ptl_msg_delayed and lnet_portal_t::ptl_msg_stealing queues are used for that. In lnet_ptl_match_delay() the message is put on the stealing queue so that lnet_ptl_attach_md() can signal that an MD is now available and that the message can be received and processed. lnet_ptl_attach_md() does this by removing it from the stealing queue, and is careful not to touch the message after that.

In lnet_ptl_match_delay() the message is put on the delay queue so that lnet_ptl_attach_md() can pick it up to complete receiving and processing it. Once it has added a message to the delay queue and dropping the lnet_ptl_lock and lnet_res_lock lnet_ptl_match_delay() has lost control over the message and should no longer reference it, and the same applies to its callers. Unfortunately in the current code it does reference the message after dropping the locks. I believe this is the cause of the crashes.

Regarding lnet_ptl_match_delay() there are several comments in the code saying that the "delay" case is not expected to happen a lot. That this race condition can be fairly reliably hit argues that some workloads exercise it regularly. And that in turn suggest that the worries (also expressed in the comments) that this code path is comparatively expensive are valid. The amount of work done in the loop in lnet_ptl_match_delay() can be considerable, which is why I think doing the entire loop under lnet_res_lock(LNET_LOCK_EX) might not be a good idea.

Comment by Bruno Faccini (Inactive) [ 08/Jan/16 ]

Hello Olaf, I think your analysis of the code involved with this issue is pretty correct, as far as I also think to understand it!
But after my 1st fix wrong attempt and next rework with your comments help, I don't see how to quickly fix it without using LNET_LOCK_EX protection in lnet_ptl_match_delay().

Comment by Olaf Weber [ 08/Jan/16 ]

Hi Bruno,

Can you explain in a bit more detail why you think lnet_res_lock(LNET_LOCK_EX) is required? As far as I can tell, avoiding any reference to a message on the delay queue is sufficient.

Comment by Bruno Faccini (Inactive) [ 18/Jan/16 ]

Hello Olaf,
Sorry for this late update but I wanted to be sure to fully understand the involved pieces of code before to answer you.

And now I can tell that you were already far ahead than me in doing so and that I am now pretty convinced that your patch should make it, and particularly the change to no longer access msg->msg_rx_delayed outside of the lnet_ptl_[un]lock()/lnet_res_[un]lock() protection but better using/checking the LNET_MATCHMD_NONE return-value instead, in both lnet_ptl_match_md() and lnet_ptl_match_delay() routines.

My previous doubts were coming from the analysis we did for the original crashes reported for this ticket, and where the corruption (re-use after free in fact) was causing a GPF in lnet_ptl_match_delay() when trying to list_del_init(&msg->msg_list), after !list_empty(&msg->msg_list) condition has been verified. But I am now convinced that this can come from the fact, after a re-use after free mainly zeroing old lnet_msg content, msg->msg_rx_delayed could have lead to an unexpected additional loop!

Thus I have abandonned http://review.whamcloud.com/17847, and since your http://review.whamcloud.com/17840 has triggered some unexpected and related errors I have re-started its auto-tests session.

Comment by Gerrit Updater [ 21/Jan/16 ]

Please ignore this patch, Patch from Olaf is essentially same, I will review his patch. Sorry I didn't read through the full comments.

(Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/18081 (abandoned)

Comment by Olaf Weber [ 21/Jan/16 ]

Hi Liang. I had considered adding an LNET_MATCHMD_DELAYED code, and do believe that approach would work. I ultimately decided against it in my proposal because I preferred to not change the interface of lnet_ptl_match_md().

Comment by Liang Zhen (Inactive) [ 21/Jan/16 ]

Olaf, yes I think your patch is quite reasonable, thanks.

Comment by Gerrit Updater [ 25/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17840/
Subject: LU-7324 lnet: Use after free in lnet_ptl_match_delay()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 607f6919ea67b101796630d4b55649a12ea0e859

Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ]

Landed for 2.8.0

Generated at Sat Feb 10 02:07:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.