[LU-14503] kiblnd: assertion that all net connections are closed may fail on shutdown Created: 09/Mar/21 Updated: 11/Jun/22 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet, o2iblnd | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
It appears that there's scenario when the following assert from kiblnd_shutdown() may fail: LASSERT (atomic_read(&net->ibn_nconns) == 0); A connection may end up on the zombie list: kiblnd_data.kib_connd_zombies Cleaning up the connections from this list is the job of kiblnd_connd instance: while (!kiblnd_data.kib_shutdown) { int reconn = 0; dropped_lock = 0; if (!list_empty(&kiblnd_data.kib_connd_zombies)) { struct kib_peer_ni *peer_ni = NULL; conn = list_entry(kiblnd_data.kib_connd_zombies.next, struct kib_conn, ibc_list); list_del(&conn->ibc_list); if (conn->ibc_reconnect) { peer_ni = conn->ibc_peer; kiblnd_peer_addref(peer_ni); } spin_unlock_irqrestore(lock, flags); dropped_lock = 1; kiblnd_destroy_conn(conn); spin_lock_irqsave(lock, flags); if (!peer_ni) { LIBCFS_FREE(conn, sizeof(*conn)); continue; } conn->ibc_peer = peer_ni; if (peer_ni->ibp_reconnected < KIB_RECONN_HIGH_RACE) list_add_tail(&conn->ibc_list, &kiblnd_data.kib_reconn_list); else list_add_tail(&conn->ibc_list, &kiblnd_data.kib_reconn_wait); } ................................ if (dropped_lock) continue; /* Nothing to do for 'timeout' */ set_current_state(TASK_INTERRUPTIBLE); add_wait_queue(&kiblnd_data.kib_connd_waitq, &wait); spin_unlock_irqrestore(lock, flags); schedule_timeout(timeout); remove_wait_queue(&kiblnd_data.kib_connd_waitq, &wait); spin_lock_irqsave(lock, flags); } The loop exits when kib_shutdown flag is set, and it is set later than the assertion in kiblnd_shutdown(), but it is possible that kiblnd_connd() is not given the chance to clean up before the assert because the kiblnd_connd instances are not signalled to wake up until the kib_shutdown flag is set. The kiblnd shutdown procedure needs to be modified to ensure that connections on the zombie list are cleaned up before asserting on it. An example of the assertion going off is reported by sihara for https://review.whamcloud.com/#/c/41937/
|
| Comments |
| Comment by Gerrit Updater [ 10/Mar/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41988 |
| Comment by Shuichi Ihara [ 17/Mar/21 ] |
|
I think patch https://review.whamcloud.com/41988 solved an crash problem which was reproduced by https://review.whamcloud.com/41988 in LU-14499. |
| Comment by Gerrit Updater [ 18/Mar/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42068 |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42068/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.16 |