[LU-14223] Potential deadlock in lnet_peer_data_present() Created: 15/Dec/20  Updated: 15/Dec/20  Resolved: 15/Dec/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Potential deadlock introduced by commit:

commit ae0ac29348023b9d8df7783bff463d07e3762f82
Author: Chris Horn <chris.horn@hpe.com>
Date:   Thu Aug 6 16:39:27 2020 -0500

    LUS-9193 lnet: Transfer disc src NID when merging peers
                        struct lnet_peer *new_lp;
                        new_lp = lpni->lpni_peer_net->lpn_peer;
...
                        spin_lock(&lp->lp_lock);
                        spin_lock(&new_lp->lp_lock);
                        if (!(lp->lp_state & LNET_PEER_NO_DISCOVERY))
                                new_lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
                        if (lp->lp_state & LNET_PEER_MULTI_RAIL)
                                new_lp->lp_state |= LNET_PEER_MULTI_RAIL;
                        /* If we're processing a ping reply then we may be
                         * about to send a push to the peer that we ping'd.
                         * Since the ping reply that we're processing was
                         * received by lp, we need to set the discovery source
                         * NID for new_lp to the NID stored in lp.
                         */
                        if (lp->lp_disc_src_nid != LNET_NID_ANY)
                                new_lp->lp_disc_src_nid = lp->lp_disc_src_nid;
                        spin_unlock(&new_lp->lp_lock);
                        spin_unlock(&lp->lp_lock);

This logic reconciles a situation where the primary NID for a known peer has changed. It works for the case where we hadn't yet fully discovered a peer, but if the peer had been previously discovered, and then it deletes its primary NID, this logic results in both "lp" and "new_lp" pointing to the same peer object. Thus we attempt to lock the same lp_lock twice and we deadlock.



 Comments   
Comment by Chris Horn [ 15/Dec/20 ]

I did not realize that the patch which introduced the regression hadn't yet landed for master. Closing as not a bug. (I'll fix the regression in the patch for LU-13894)

Generated at Sat Feb 10 03:07:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.