[LU-13534] Landing an LU-12678 high likely introduce a random memory corruption bug Created: 07/May/20  Updated: 17/Feb/21  Resolved: 10/Jul/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Neil Brown
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13638 lnet: discard the callback Open
is related to LU-12678 LNet simplification work from linux c... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

with landing an LU-12678, ptlrpc hold an object pointer without reference to it (lnet_me don't have a reference).
Scenario is
lnet monitor_thread found an expired response and start a kill MD once no references MD start to kill an ME entry, but ptlrpc have a reference to the ME object and try to kill ME itself.

void
lnet_md_unlink(struct lnet_libmd *md)
{
        if ((md->md_flags & LNET_MD_FLAG_ZOMBIE) == 0) {
                /* first unlink attempt... */
                struct lnet_me *me = md->md_me;

                md->md_flags |= LNET_MD_FLAG_ZOMBIE;

                /* Disassociate from ME (if any), and unlink it if it was created
                 * with LNET_UNLINK */
                if (me != NULL) {
                        /* detach MD from portal */
                        lnet_ptl_detach_md(me, md);
                        if (me->me_unlink == LNET_UNLINK)
                                lnet_me_unlink(me);
                }

                /* ensure all future handle lookups fail */
                lnet_res_lh_invalidate(&md->md_lh);
        }

        if (md->md_refcount != 0) {
                CDEBUG(D_NET, "Queueing unlink of md %p\n", md);
                return;
        }



so lnet_me isn't protected by MD reference.



 Comments   
Comment by James A Simmons [ 05/Jun/20 ]

https://review.whamcloud.com/#/c/38646/

Comment by James A Simmons [ 10/Jul/20 ]

Patch https://review.whamcloud.com/#/c/38646 landed which should of resolved this issue.

Generated at Sat Feb 10 03:02:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.