Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
kfilnd <> kfabric <> kcxi_prov <-> Cassini has the following issue. If LNet is trying to send to a kfilnd peer which is down, the Cassini retry handler can take up to 60 seconds to cancel the corresponding message. When retrying, the Cassini retry handler takes control of hardware resources and does not release them until the retries are complete. If enough of these messages are sent to down peers, it is possible the Cassini retry handler can take control of all available hardware resources. Once this happens, Cassini cannot process new RDMA commands and back pressure will start occuring in kcxi_prov which gets propagated to kfilnd as an -EAGAIN. As seen on JT, this can results in single RDMA operations taking minutes to complete.
To help prevent this issue, kfilnd should only send to peers it knows are up.
Looking at the kfilnd code today, I believe we can have multiple hello messages inflight to a single. If the peer is down, this can result in a build up of hello messages where each hello message will take the CXI retry handler 60 seconds to complete. There should only be a single hello message inflight per peer.
As a part of this work, kiflnd transaction may have to be queued until a hello message comes back. If the hello message results in a failure, all queued transactions should be finalized.