[LU-13571] Refine which network errors result in LNet Health activity Created: 15/May/20 Updated: 23/Feb/21 Resolved: 03/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Epic Link: | unlabelled-LU-13422 | ||||
| Description |
|
Rather than ignore these errors we decided that with the enhancement in https://jira.whamcloud.com/browse/LU-13569 we should instead have LND return LNET_MSG_STATUS_NETWORK_TIMEOUT to LNet so that both the local NI and remote NI health is ding'd. This way, if the problem really is with the remote NI then we can have that reflected in the health value for the remote NI and it can be accounted for on future sends. With Related to this, we decided that the LOCAL_TIMEOUT returned in kiblnd_check_conns() path should also be NETWORK_TIMEOUT: kiblnd_check_conns()
...
/* Check tx_deadline */
list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
if (ktime_compare(ktime_get(), tx->tx_deadline) >= 0) {
CWARN("Timed out tx for %s: %lld seconds\n",
libcfs_nid2str(peer_ni->ibp_nid),
ktime_ms_delta(ktime_get(),
tx->tx_deadline) / MSEC_PER_SEC);
list_move(&tx->tx_list, &timedout_txs);
}
}
...
if (!list_empty(&timedout_txs))
kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT,
LNET_MSG_STATUS_LOCAL_TIMEOUT);
So for this ticket I plan to push three patches: 3 probably needs to be based on top of the patches for |
| Comments |
| Comment by Gerrit Updater [ 14/Sep/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39898 |
| Comment by Gerrit Updater [ 14/Sep/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39901 |
| Comment by Gerrit Updater [ 14/Sep/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39899 |
| Comment by Gerrit Updater [ 14/Sep/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39900 |
| Comment by Gerrit Updater [ 17/Sep/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39965 |
| Comment by Gerrit Updater [ 26/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39898/ |
| Comment by Gerrit Updater [ 03/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39899/ |
| Comment by Gerrit Updater [ 03/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39900/ |
| Comment by Peter Jones [ 03/Dec/20 ] |
|
Landed for 2.14 |