[LU-13648] Route status can be set incorrectly via lnet_notify() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Found an issue when testing my patch to skip health/resends for single rail config - https://review.whamcloud.com/#/c/38448

Problem is in lnet_notify(). LND tells us some nid went down. But we calculate aliveness based on health values and NI status as reported by discovery:

        /* recalculate aliveness */
        alive = lnet_is_peer_ni_alive(lpni);

static inline bool
lnet_is_peer_ni_alive(struct lnet_peer_ni *lpni)
{
        bool halive = false;

        halive = (atomic_read(&lpni->lpni_healthv) >=
                 (LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage / 100));

        return halive && lpni->lpni_ns_status == LNET_NI_STATUS_UP;
}

And then use that to set routes up/down:

                lp = lpni->lpni_peer_net->lpn_peer;
                lp->lp_alive = alive;
                list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
                        lnet_set_route_aliveness(route, alive);

But, at least with gnilnd, we can get notification from LND before any tx sent to the router has failed. So health may not have been ding'd yet. This can lead to routes in the wrong up/down state.

Attachments

Activity

[LU-13648] Route status can be set incorrectly via lnet_notify()

Peter Jones added a comment - 11/Jul/20 4:00 PM

Landed for 2.14

Peter Jones added a comment - 11/Jul/20 4:00 PM Landed for 2.14

Gerrit Updater added a comment - 10/Jul/20 4:52 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38862/
Subject: ~~LU-13648~~ lnet: Set remote NI status in lnet_notify
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8010dbb6607664a613e6496d42ee70d40a15dc6a

Gerrit Updater added a comment - 10/Jul/20 4:52 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38862/ Subject: LU-13648 lnet: Set remote NI status in lnet_notify Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8010dbb6607664a613e6496d42ee70d40a15dc6a

Gerrit Updater added a comment - 08/Jun/20 2:19 PM

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38862
Subject: ~~LU-13648~~ lnet: Set remote NI status in lnet_notify
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2f33fcb3ea98fd9fd48d38d4af548f85168bdfc7

Gerrit Updater added a comment - 08/Jun/20 2:19 PM Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38862 Subject: LU-13648 lnet: Set remote NI status in lnet_notify Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f33fcb3ea98fd9fd48d38d4af548f85168bdfc7

People

Assignee:: Chris Horn

Reporter:: Chris Horn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Jun/20 2:14 PM

Updated:: 11/Jul/20 4:00 PM

Resolved:: 11/Jul/20 4:00 PM