[LU-13362] Race between message finalize and discovery reply processing Created: 16/Mar/20  Updated: 21/May/21  Resolved: 24/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Two threads trying to acquire lnet_net_lock(1):

crash_x86_64> p ((struct cfs_percpt_lock *)0xffff8808299ed880)->pcl_locks[1]
$11 = (spinlock_t *) 0xffff88076447c640
*hornc@cflosbld09 200312170801 $ grep -B 4 -A 8 ffff88076447c640 bt.all
PID: 8971   TASK: ffff88076b3c4f00  CPU: 1   COMMAND: "kiblnd_sd_00_00"
    [exception RIP: queued_spin_lock_slowpath+283]
    RIP: ffffffff810a573b  RSP: ffffc90000ea7a80  RFLAGS: 00000202
    RAX: 0000000000300101  RBX: ffff8808299ed880  RCX: 0000000000000001
    RDX: 0000000000000101  RSI: 0000000000000001  RDI: ffff88076447c640
    RBP: ffffc90000ea7a80   R8: 0000000000000101   R9: ffffffffa05d1729
    R10: 0000000000000000  R11: ffffffffa05ebb48  R12: 0000000000000008
    R13: ffff88082be96600  R14: ffff88082be96600  R15: ffff88071da3e568
    CS: 0010  SS: 0018
 #0 [ffffc90000ea7a88] _raw_spin_lock at ffffffff816b5411
 #1 [ffffc90000ea7a98] cfs_percpt_lock at ffffffffa053c546 [libcfs]
 #2 [ffffc90000ea7ac0] lnet_handle_send at ffffffffa05a3bd2 [lnet]
 #3 [ffffc90000ea7b28] lnet_select_pathway at ffffffffa05a6e18 [lnet]
--
PID: 8983   TASK: ffff8807167a3080  CPU: 11  COMMAND: "lnet_discovery"
    [exception RIP: queued_spin_lock_slowpath+184]
    RIP: ffffffff810a56d8  RSP: ffffc900286e3e48  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: ffff8808299ed880  RCX: ffff88085fae2a80
    RDX: 0000000000300101  RSI: 0000000000000101  RDI: ffff88076447c640
    RBP: ffffc900286e3e48   R8: 0000000000300000   R9: 0000000000000000
    R10: ffffc900286e3e90  R11: 0000000000000f2f  R12: 0000000000000008
    R13: ffff880828f021d0  R14: 0000000000000240  R15: 000500280a0c0035
    CS: 0010  SS: 0018
 #0 [ffffc900286e3e50] _raw_spin_lock at ffffffff816b5411
 #1 [ffffc900286e3e60] cfs_percpt_lock at ffffffffa053c546 [libcfs]
 #2 [ffffc900286e3e88] lnet_peer_discovery at ffffffffa05bc031 [lnet]
 #3 [ffffc900286e3f08] kthread at ffffffff8107e807
*hornc@cflosbld09 200312170801 $

A debug patch tells us which thread is holding this lock:

crash_x86_64> p *((struct cfs_percpt_lock *)0xffff8808299ed880)->pcl_locks_info[1]
$4 = {
  pcl_fn_ptr = 0xffffffffa05a69b7 <lnet_select_pathway+103>,
  pcl_thr_ptr = 0xffff88082cd6f040
}
crash_x86_64> task 0xffff88082cd6f040 | grep PID
PID: 8965   TASK: ffff88082cd6f040  CPU: 10  COMMAND: "kgnilnd_sd_06"
crash_x86_64> bt 8965
PID: 8965   TASK: ffff88082cd6f040  CPU: 10  COMMAND: "kgnilnd_sd_06"
    [exception RIP: queued_spin_lock_slowpath+180]
    RIP: ffffffff810a56d4  RSP: ffffc90000d73ad8  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: ffff880828f02100  RCX: ffff88085faa2a80
    RDX: 00000000002c0101  RSI: 0000000000000101  RDI: ffff880828f02184
    RBP: ffffc90000d73ad8   R8: 00000000002c0000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88071e12d568
    R13: ffff880828f02100  R14: 0000000000000001  R15: ffffffffffffffff
    CS: 0010  SS: 0018
 #0 [ffffc90000d73ae0] _raw_spin_lock at ffffffff816b5411
 #1 [ffffc90000d73af0] lnet_peer_gw_discovery at ffffffffa05bd329 [lnet]
 #2 [ffffc90000d73b08] lnet_initiate_peer_discovery at ffffffffa05a31c3 [lnet]
 #3 [ffffc90000d73b40] lnet_select_pathway at ffffffffa05a6a4c [lnet]
 #4 [ffffc90000d73c40] lnet_send at ffffffffa05a7af5 [lnet]
 #5 [ffffc90000d73c60] lnet_finalize at ffffffffa05999fa [lnet]
 #6 [ffffc90000d73cc0] kgnilnd_tx_done at ffffffffa064cba4 [kgnilnd]
 #7 [ffffc90000d73d20] kgnilnd_check_fma_send_cq at ffffffffa0655884 [kgnilnd]
 #8 [ffffc90000d73e78] kgnilnd_scheduler at ffffffffa065a6ca [kgnilnd]
 #9 [ffffc90000d73f08] kthread at ffffffff8107e807
#10 [ffffc90000d73f50] ret_from_fork at ffffffff8180024a
crash_x86_64>

This thread is trying to get lp_lock:

bool
lnet_peer_gw_discovery(struct lnet_peer *lp)
{
        bool rc = false;

        spin_lock(&lp->lp_lock); <<<< Thread here

The peer in question:

crash_x86_64> struct -o lnet_peer | grep lp_lock
  [132] spinlock_t lp_lock;
crash_x86_64> eval ffff880828f02184 - 132
hexadecimal: ffff880828f02100
    decimal: 18446612167360782592  (-131906348769024)
      octal: 1777774200405074020400
     binary: 1111111111111111100010000000100000101000111100000010000100000000
crash_x86_64>

From backtrace info in the dump we can see a few threads with this address in their stack.

One is a kgnilnd_sd thread:

PID: 8960   TASK: ffff8807644aa0c0  CPU: 13  COMMAND: "kgnilnd_sd_01"

This thread is finalizing a message. It is stuck acquiring an lp_lock in lnet_return_rx_credits_locked().

lnet_return_rx_credits_locked(struct lnet_msg *msg)
{
...
                lp = rxpeerni->lpni_peer_net->lpn_peer;

                /* give back peer router credits */
                msg->msg_peerrtrcredit = 0;

                spin_lock(&rxpeerni->lpni_lock);
                spin_lock(&lp->lp_lock);  <<<<<<<<<< Thread is here

Meanwhile, a kiblnd_sd thread is processing a discovery reply:

crash_x86_64> bt 8977
PID: 8977   TASK: ffff88076550afc0  CPU: 15  COMMAND: "kiblnd_sd_03_00"
    [exception RIP: queued_spin_lock_slowpath+281]
    RIP: ffffffff810a5739  RSP: ffffc90000f13ae8  RFLAGS: 00000202
    RAX: 0000000000000101  RBX: ffff88082cfe3200  RCX: 0000000000000001
    RDX: 0000000000000101  RSI: 0000000000000001  RDI: ffff88082cfe32a4
    RBP: ffffc90000f13ae8   R8: 0000000000000101   R9: 000000000000000f
    R10: 0000000000000001  R11: ffff8806e12b7eea  R12: ffff880828f02100
    R13: ffff880828cb2209  R14: ffff880828f02184  R15: ffff880828cb2200
    CS: 0010  SS: 0018
 #0 [ffffc90000f13af0] _raw_spin_lock at ffffffff816b5411
 #1 [ffffc90000f13b00] lnet_peer_ni_clr_non_mr_pref_nid at ffffffffa05b81eb [lnet]
 #2 [ffffc90000f13b20] lnet_peer_clr_non_mr_pref_nids at ffffffffa05b82db [lnet]
 #3 [ffffc90000f13b40] lnet_discovery_event_handler at ffffffffa05be2cf [lnet]

It is holding lp_lock.

lnet_discovery_event_reply(struct lnet_peer *lp, struct lnet_event *ev)
{
...
        spin_lock(&lp->lp_lock);
...
                        CDEBUG(D_NET, "peer %s(%p) is MR capable\n",
                               libcfs_nid2str(lp->lp_primary_nid), lp);
                        lp->lp_state |= LNET_PEER_MULTI_RAIL;
                        lnet_peer_clr_non_mr_pref_nids(lp);

void
lnet_peer_clr_non_mr_pref_nids(struct lnet_peer *lp)
{
        struct lnet_peer_ni *lpni = NULL;

        while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
                lnet_peer_ni_clr_non_mr_pref_nid(lpni);
}

lnet_peer_ni_clr_non_mr_pref_nid(struct lnet_peer_ni *lpni)
{
...
        spin_lock(&lpni->lpni_lock); <<<< Thread is here

Thread A: spin_lock(lp_lock)
Thread B: spin_lock(lpni_lock)
Thread A: spin_lock(lpni_lock)
Thread B: spin_lock(lp_lock)
Deadlock.

Since these two threads are deadlocked, this is preventing PID 8965 (which is holding lnet_net_lock(1)) from making progress. Thus causing further deadlocks with threads trying to acquire lnet_net_lock(1).

Same issue exists with lnet_post_routed_recv_locked()



 Comments   
Comment by Gerrit Updater [ 16/Mar/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/37937
Subject: LU-13362 lnet: Disc reply race with finalize and routed recv
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1453f3fdf7c3d6bba99a3b464b0a5b19997db1f5

Comment by Gerrit Updater [ 24/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37937/
Subject: LU-13362 lnet: Disc reply race with finalize and routed recv
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c700b4a410ab5542391d006ce541023ecf9b7a5d

Comment by Peter Jones [ 24/Mar/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:00:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.