[LU-14503] kiblnd: assertion that all net connections are closed may fail on shutdown - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- lnet
- o2iblnd

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

It appears that there's scenario when the following assert from kiblnd_shutdown() may fail:

LASSERT (atomic_read(&net->ibn_nconns) == 0);

A connection may end up on the zombie list:

kiblnd_data.kib_connd_zombies

Cleaning up the connections from this list is the job of kiblnd_connd instance:

         while (!kiblnd_data.kib_shutdown) {
                 int reconn = 0;
 
                 dropped_lock = 0;
 
                 if (!list_empty(&kiblnd_data.kib_connd_zombies)) {
                         struct kib_peer_ni *peer_ni = NULL;
 
                         conn = list_entry(kiblnd_data.kib_connd_zombies.next,
                                           struct kib_conn, ibc_list);
                         list_del(&conn->ibc_list);
                         if (conn->ibc_reconnect) {
                                 peer_ni = conn->ibc_peer;
                                 kiblnd_peer_addref(peer_ni);
                         }
 
                         spin_unlock_irqrestore(lock, flags);
                         dropped_lock = 1;
 
                         kiblnd_destroy_conn(conn);
 
                         spin_lock_irqsave(lock, flags);
                         if (!peer_ni) {
                                 LIBCFS_FREE(conn, sizeof(*conn));
                                 continue;
                         }
 
                         conn->ibc_peer = peer_ni;
                         if (peer_ni->ibp_reconnected < KIB_RECONN_HIGH_RACE)
                                 list_add_tail(&conn->ibc_list,                                               &kiblnd_data.kib_reconn_list);
                         else
                                 list_add_tail(&conn->ibc_list,
                                               &kiblnd_data.kib_reconn_wait);
                 } 

................................
                 if (dropped_lock)
                         continue;
 
                 /* Nothing to do for 'timeout'  */
                 set_current_state(TASK_INTERRUPTIBLE);
                 add_wait_queue(&kiblnd_data.kib_connd_waitq, &wait);
                 spin_unlock_irqrestore(lock, flags);
 
                 schedule_timeout(timeout);
 
                 remove_wait_queue(&kiblnd_data.kib_connd_waitq, &wait);
                 spin_lock_irqsave(lock, flags);
         }

The loop exits when kib_shutdown flag is set, and it is set later than the assertion in kiblnd_shutdown(), but it is possible that kiblnd_connd() is not given the chance to clean up before the assert because the kiblnd_connd instances are not signalled to wake up until the kib_shutdown flag is set.

The kiblnd shutdown procedure needs to be modified to ensure that connections on the zombie list are cleaned up before asserting on it.

An example of the assertion going off is reported by sihara for https://review.whamcloud.com/#/c/41937/

Attachments

Activity

[LU-14503] kiblnd: assertion that all net connections are closed may fail on shutdown

Peter Jones added a comment - 11/Jun/22 3:26 PM

Landed for 2.16

Peter Jones added a comment - 11/Jun/22 3:26 PM Landed for 2.16

Gerrit Updater added a comment - 11/Jun/22 5:30 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42068/
Subject: ~~LU-14503~~ o2iblnd: clean up zombie connections on shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2a183829cdcc7008f2b9706cb212b22b877dfce0

Gerrit Updater added a comment - 11/Jun/22 5:30 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42068/ Subject: LU-14503 o2iblnd: clean up zombie connections on shutdown Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a183829cdcc7008f2b9706cb212b22b877dfce0

Gerrit Updater added a comment - 18/Mar/21 3:53 AM

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42068
Subject: ~~LU-14503~~ o2iblnd: clean up zombie connections on shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 016029d97a8af446452b9934f4a01d4ea800ea7e

Gerrit Updater added a comment - 18/Mar/21 3:53 AM Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42068 Subject: LU-14503 o2iblnd: clean up zombie connections on shutdown Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 016029d97a8af446452b9934f4a01d4ea800ea7e

Shuichi Ihara added a comment - 17/Mar/21 10:48 PM

I think patch https://review.whamcloud.com/41988 solved an crash problem which was reproduced by https://review.whamcloud.com/41988 in ~~LU-14499~~.
I've continually ran same reproducer of ~~LU-14499~~ more than 100 times, but the problem never happened.

Shuichi Ihara added a comment - 17/Mar/21 10:48 PM I think patch https://review.whamcloud.com/41988 solved an crash problem which was reproduced by https://review.whamcloud.com/41988 in LU-14499 . I've continually ran same reproducer of LU-14499 more than 100 times, but the problem never happened.

Gerrit Updater added a comment - 10/Mar/21 7:52 PM

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41988
Subject: ~~LU-14503~~ o2iblnd: clean up zombie connections on shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d29f5d5998b9082f370bb52b337930ec6f246530

Gerrit Updater added a comment - 10/Mar/21 7:52 PM Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41988 Subject: LU-14503 o2iblnd: clean up zombie connections on shutdown Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d29f5d5998b9082f370bb52b337930ec6f246530

People

Assignee:: Serguei Smirnov

Reporter:: Serguei Smirnov

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Mar/21 10:39 PM

Updated:: 11/Jun/22 3:26 PM

Resolved:: 11/Jun/22 3:26 PM