[LU-13648] Route status can be set incorrectly via lnet_notify() Created: 08/Jun/20 Updated: 11/Jul/20 Resolved: 11/Jul/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Found an issue when testing my patch to skip health/resends for single rail config - https://review.whamcloud.com/#/c/38448 Problem is in lnet_notify(). LND tells us some nid went down. But we calculate aliveness based on health values and NI status as reported by discovery: /* recalculate aliveness */
alive = lnet_is_peer_ni_alive(lpni);
static inline bool
lnet_is_peer_ni_alive(struct lnet_peer_ni *lpni)
{
bool halive = false;
halive = (atomic_read(&lpni->lpni_healthv) >=
(LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage / 100));
return halive && lpni->lpni_ns_status == LNET_NI_STATUS_UP;
}
And then use that to set routes up/down: lp = lpni->lpni_peer_net->lpn_peer;
lp->lp_alive = alive;
list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
lnet_set_route_aliveness(route, alive);
But, at least with gnilnd, we can get notification from LND before any tx sent to the router has failed. So health may not have been ding'd yet. This can lead to routes in the wrong up/down state. |
| Comments |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38862 |
| Comment by Gerrit Updater [ 10/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38862/ |
| Comment by Peter Jones [ 11/Jul/20 ] |
|
Landed for 2.14 |