[LU-12233] Deadlock on LNet shutdown Created: 26/Apr/19  Updated: 29/Oct/20  Resolved: 25/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13807 LNet: Net delete dead lock deadlock Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I reproduced this issue with master and Cray's 2.12 branch. For completeness I'll note that my master was slightly modified so that I can configure LNet on Cray's hardware, and I also applied the fix from LU-11756.

Here's the relevant git log. Commit '8cb7ccf54e' is on master.

86ef522cac LU-11756 o2iblnd: kib_conn leak
888adb9340 MRP-342 lnet: add config file support
d661c584c6 Revert "LU-11838 lnet: change lnet_ipaddr_enumerate() to use for_each_netdev()"
4c681cf4ee Revert "LU-11838 o2iblnd: get IP address more directly."
f4fe014620 Revert "LU-6399 lnet: socket cleanup"
8cb7ccf54e LU-11986 lnet: properly cleanup lnet debugfs files

LNetNIFini() takes the ln_api_mutex and then shuts down LNet. It doesn't release the mutex until all teardown functions have returned.

The message receive path also takes the ln_api_mutex in lnet_nid2peerni_locked().
kgnilnd_check_fma_rx->lnet_parse->lnet_nid2peerni_locked
kiblnd_handle_rx->lnet_parse->lnet_nid2peerni_locked
ksocknal_process_receive->lnet_parse->lnet_nid2peerni_locked

/*
 * Get a peer_ni for the given nid, create it if necessary. Takes a
 * hold on the peer_ni.
 */
struct lnet_peer_ni *
lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref, int cpt)
{
        struct lnet_peer_ni *lpni = NULL;
        int rc;

        if (the_lnet.ln_state != LNET_STATE_RUNNING)
                return ERR_PTR(-ESHUTDOWN);

        /*
         * find if a peer_ni already exists.
         * If so then just return that.
         */
        lpni = lnet_find_peer_ni_locked(nid);
        if (lpni)
                return lpni;

        /*
         * Slow path:
         * use the lnet_api_mutex to serialize the creation of the peer_ni
         * and the creation/deletion of the local ni/net. When a local ni is
         * created, if there exists a set of peer_nis on that network,
         * they need to be traversed and updated. When a local NI is
         * deleted, which could result in a network being deleted, then
         * all peer nis on that network need to be removed as well.
         *
         * Creation through traffic should also be serialized with
         * creation through DLC.
         */
        lnet_net_unlock(cpt);
        mutex_lock(&the_lnet.ln_api_mutex);
int
LNetNIFini()
{
        mutex_lock(&the_lnet.ln_api_mutex);

        LASSERT(the_lnet.ln_refcount > 0);

        if (the_lnet.ln_refcount != 1) {
                the_lnet.ln_refcount--;
        } else {
                LASSERT(!the_lnet.ln_niinit_self);

                lnet_fault_fini();

                lnet_router_debugfs_init();
                lnet_peer_discovery_stop();
                lnet_push_target_fini();
                lnet_monitor_thr_stop();
                lnet_ping_target_fini();

                /* Teardown fns that use my own API functions BEFORE here */
                the_lnet.ln_refcount = 0;

                lnet_acceptor_stop();
                lnet_destroy_routes();
                lnet_shutdown_lndnets(); <<<  the_lnet.ln_state = LNET_STATE_STOPPING; happens here
                lnet_unprepare();
        }

        mutex_unlock(&the_lnet.ln_api_mutex);
        return 0;
}
EXPORT_SYMBOL(LNetNIFini);

We can see there is a decent sized window where the deadlock can be hit.

It is easy to reproduce for me.

[root@snx11922n000 ~]# pdsh -g lustre modprobe lnet; lctl net up ; lctl list_nids ; lctl ping 10.12.0.50@o2ib40 ; lctl net down ; lustre_rmmod

Sometimes the command needs to be repeated a couple of times.

I believe this regression was introduced by:

commit fa8b4e6357c53ea457ef6624b0b19bece0b0fdde
Author: Amir Shehata <amir.shehata@intel.com>
Date:   Thu May 26 15:42:39 2016 -0700

    LU-7734 lnet: peer/peer_ni handling adjustments


 Comments   
Comment by Chris Horn [ 26/Apr/19 ]

I reported this issue internally at Cray, but I don't have the cycles to work on it right now. If someone else is able to work out a patch that'd be great, and I'll be sure to update this ticket if Cray begins work on this issue.

Comment by Chris Horn [ 29/Apr/19 ]

The specific instance I found was with iblnd. Maybe it's specific to that.

LNetNIFini()->lnet_shutdown_lndnets()->lnet_shutdown_lndnet()->lnet_shutdown_lndni()->lnet_clear_zombies_nis_locked()->kiblnd_shutdown() <<<< We're holding ln_api_mutex

In kiblnd_shutdown() we're waiting for all peers to disconnect:

                while (atomic_read(&net->ibn_npeers) != 0) {
                        i++;
                        /* power of 2? */
                        CDEBUG(((i & (-i)) == i) ? D_WARNING : D_NET,
                               "%s: waiting for %d peers to disconnect\n",
                               libcfs_nid2str(ni->ni_nid),
                               atomic_read(&net->ibn_npeers));
                        set_current_state(TASK_UNINTERRUPTIBLE);
                        schedule_timeout(cfs_time_seconds(1));
                }

But the ib_cm thread is stuck trying to acquire the ln_api_mutex:

Apr 25 19:48:15 snx11922n002 kernel: INFO: task kworker/0:0:16461 blocked for more than 120 seconds.
Apr 25 19:48:15 snx11922n002 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 25 19:48:15 snx11922n002 kernel: kworker/0:0     D ffff880172949fa0     0 16461      2 0x00000000
Apr 25 19:48:15 snx11922n002 kernel: Workqueue: ib_cm cm_work_handler [ib_cm]
Apr 25 19:48:15 snx11922n002 kernel: Call Trace:
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff816b5329>] schedule_preempt_disabled+0x29/0x70
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff816b30d7>] __mutex_lock_slowpath+0xc7/0x1d0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff816b24bf>] mutex_lock+0x1f/0x2f
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0b2ed91>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0b1ee34>] lnet_parse+0x794/0x1260 [lnet]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0a42163>] kiblnd_handle_rx+0x213/0x6b0 [ko2iblnd]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0a4267f>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0a43576>] kiblnd_connreq_done+0x286/0x6c0 [ko2iblnd]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc0a46327>] kiblnd_cm_callback+0x11e7/0x2390 [ko2iblnd]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810aa05a>] ? __queue_delayed_work+0xaa/0x1a0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810aa3c1>] ? try_to_grab_pending+0xb1/0x160
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc05013e0>] cma_ib_handler+0xc0/0x290 [rdma_cm]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc04ed5ab>] cm_process_work+0x2b/0x130 [ib_cm]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffffc04ef943>] cm_work_handler+0xaa3/0x12db [ib_cm]
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810abe2f>] process_one_work+0x17f/0x440
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810acaf6>] worker_thread+0x126/0x3c0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810ac9d0>] ? manage_workers.isra.24+0x2a0/0x2a0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810b4031>] kthread+0xd1/0xe0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff816c155d>] ret_from_fork+0x5d/0xb0
Apr 25 19:48:15 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Apr 25 19:49:10 snx11922n002 kernel: LNet: 15887:0:(o2iblnd.c:3032:kiblnd_shutdown()) 10.12.0.50@o2ib40: waiting for 1 peers to disconnect

Another time I saw kiblnd_sd threads stuck:

Apr 26 00:00:06 snx11922n002 kernel: INFO: task kiblnd_sd_02_00:32191 blocked for more than 120 seconds.
Apr 26 00:00:06 snx11922n002 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 00:00:06 snx11922n002 kernel: kiblnd_sd_02_00 D ffff881fce3adee0     0 32191      2 0x00000000
Apr 26 00:00:06 snx11922n002 kernel: Call Trace:
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810cc938>] ? __enqueue_entity+0x78/0x80
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff816b5329>] schedule_preempt_disabled+0x29/0x70
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff816b30d7>] __mutex_lock_slowpath+0xc7/0x1d0
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff816b24bf>] mutex_lock+0x1f/0x2f
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffffc0b799d1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffffc0b66ba1>] lnet_parse+0x791/0x11e0 [lnet]
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffffc090b153>] kiblnd_handle_rx+0x213/0x6b0 [ko2iblnd]
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffffc091228c>] kiblnd_scheduler+0xf3c/0x1180 [ko2iblnd]
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810cb0c5>] ? sched_clock_cpu+0x85/0xc0
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810c7c80>] ? wake_up_state+0x20/0x20
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffffc0911350>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810b4031>] kthread+0xd1/0xe0
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff816c155d>] ret_from_fork+0x5d/0xb0
Apr 26 00:00:06 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Comment by Chris Horn [ 29/Jul/19 ]

Started hitting this internally a bit more often. Bumping priority.

Comment by Chris Horn [ 12/Sep/19 ]

Maybe this would work?

diff --git a/lnet/lnet/api-ni.c b/lnet/lnet/api-ni.c
index d50c9395e2..f4d224d39f 100644
--- a/lnet/lnet/api-ni.c
+++ b/lnet/lnet/api-ni.c
@@ -2155,11 +2155,10 @@ lnet_shutdown_lndnets(void)
 	/* NB called holding the global mutex */

 	/* All quiet on the API front */
-	LASSERT(the_lnet.ln_state == LNET_STATE_RUNNING);
+	LASSERT(the_lnet.ln_state == LNET_STATE_STOPPING);
 	LASSERT(the_lnet.ln_refcount == 0);

 	lnet_net_lock(LNET_LOCK_EX);
-	the_lnet.ln_state = LNET_STATE_STOPPING;

 	while (!list_empty(&the_lnet.ln_nets)) {
 		/*
@@ -2746,6 +2745,10 @@ EXPORT_SYMBOL(LNetNIInit);
 int
 LNetNIFini()
 {
+	lnet_net_lock(LNET_LOCK_EX);
+	the_lnet.ln_state = LNET_STATE_STOPPING;
+	lnet_net_unlock(LNET_LOCK_EX);
+
 	mutex_lock(&the_lnet.ln_api_mutex);

 	LASSERT(the_lnet.ln_refcount > 0);
Comment by Chris Horn [ 06/Aug/20 ]

There's another potential deadlock with shutdown and the discovery thread.

lnet_peer_data_present() tries to take the ln_api_mutex and then check ln_state, but shutdown thread could already be holding the ln_api_mutex.

static int lnet_peer_data_present(struct lnet_peer *lp)
...
        mutex_lock(&the_lnet.ln_api_mutex);
        if (the_lnet.ln_state != LNET_STATE_RUNNING) {
                rc = -ESHUTDOWN;
                goto out;
        }
Comment by Gerrit Updater [ 16/Sep/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39933
Subject: LU-12233 lnet: deadlock on LNet shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a5d809cbf8c42b25c306b3ec38122025e799e406

Comment by Gerrit Updater [ 25/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39933/
Subject: LU-12233 lnet: deadlock on LNet shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e0c445648a38fb72cc426ac0c16c33f5183cda08

Comment by Peter Jones [ 25/Sep/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 07/Oct/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40171
Subject: LU-12233 lnet: deadlock on LNet shutdown
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c60b67b2bce3067c9a1d1a1c96ec3a86e931ca79

Comment by Gerrit Updater [ 29/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40171/
Subject: LU-12233 lnet: deadlock on LNet shutdown
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 6d92d5d0e710e60a8ede7da19e6a577696697a91

Generated at Sat Feb 10 02:50:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.