Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
It was observed on an o2ib filesystem connected via LNet routers to tcp clients that when a client crashed this could result in the LNet health of o2ib router peer NIs having their health decremented.
For example, an OSS sends bulk payload as an LNet PUT with an ACK requested. The router only sends the ACK after the message is successfully forwarded to the client. Since the client is crashed the message cannot be forwarded, and the ACK is not sent back to the OSS. This causes an LNet "response timeout", and the health of the router's peer NI is decremented – placing the NI in recovery.
/* * A peer NI is alive if it satisfies the following two conditions: * 1. peer NI health >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage * 2. the cached NI status received when we discover the peer is UP */ static inline bool lnet_is_peer_ni_alive(struct lnet_peer_ni *lpni) { bool halive = false; halive = (atomic_read(&lpni->lpni_healthv) >= (LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage / 100)); return halive && lpni->lpni_ns_status == LNET_NI_STATUS_UP; } static struct lnet_route * lnet_find_route_locked(struct lnet_remotenet *rnet, __u32 src_net, struct lnet_peer_ni *remote_lpni, struct lnet_route **prev_route, struct lnet_peer_ni **gwni) { ... list_for_each_entry(route, &rnet->lrn_routes, lr_list) { if (!lnet_is_route_alive(route)) continue;
If a route is not considered "alive" then we will not use it for any sends. If no routes are "alive" then the send will fail. e.g.:
[5236447.659951] LNetError: 1850029:0:(lib-move.c:2341:lnet_handle_find_routed_path()) no route to 10.112.48.209@tcp5700 from 172.22.12.164@o2ib21
If a router NI's health is decremented, then it is considered dead/down. If all NI's belonging to a router are dead/down then the route is dead/down.
Thus, it is possible for clients crashing to result in all server routes going down. This could further hinder availability of an OSS.
We should modify the route selection to avoid this issue. One idea is to remove consideration of the peer NI health value from lnet_is_peer_ni_alive(). We could instead use the health value as a selection criteria (i.e. prefer "healthier" routers).
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57493
Subject: LU-18444 lnet: Remove per-peer health sensitivity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8d266f74c15687d7a8cb15d52c50676a5bc96a5f