Affects Version/s: Lustre 2.14.0
Fix Version/s: None
I'm opening this as a critical because in my testing (mix of IOR & mdtest) this issue causes serious I/O performance problems and occasionally client evictions.
The problem is that routed clients do not have a view of the health of remote server interfaces, but clients are responsible for selecting the server interface a message will be sent to. Since clients don't have credits or health tracking of remote interfaces they simply round-robin across all available interfaces. Thus, if a server has two interfaces, and one of them fail, then approximately half of all messages sent by the client to the server will be destined for failure.
This can cause all progress on jobs to basically halt as clients get into a reconnection loop with targets.
1. Some rpc sent to bad interface.
2. RPC timeout.
3. Connect RPC sent to good interface.
4. Connection re-established.
5. Repeat step 1.
I think you can get unlucky with reconnects if you have failover partners.
1. First reconnect sent to bad interface -> fails.
2. Next reconnect goes to failover partner -> target not mounted there.
3. Next reconnect sent to good interface.
This has also unsurprisingly resulted in client evictions which we know can often lead to job loss as many programs do not check for EIO and re-try.
I discussed this issue with Amir and he proposed the following solution:
[T]he edge routers [ought] to advertise the health of the final destination upon change, to the relevant peers. The peers can then make proper health selection. I'm gonna summarise the approach on a wiki page.
This is a non-trivial solution. There will be scalability concerns with having routers push updates to all peers. We'll also need to account for many routers pushing the same information.
I have an alternative proposal with its own downsides, but the major upside is it is very easy to implement (I already have patch in hand for it). My proposal is to allow routers to perform what I'll call multi-rail forwarding.
As mentioned earlier, clients currently have the responsibility for selecting the server interface a message will be sent to. MR forwarding would allow edge routers to make this decision instead. Edge routers, a.k.a. the final hop gateway, are able to leverage LNet health to determine the health of their local peers' interfaces. Thus, if we allow them to select the destination interface then we can avoid sending traffic to interfaces that have failed once the failure has been registered by the edge routers.
The problem is that an edge router may not know whether the originator of a message has discovered the destination. As such, the router may forward the message to an interface the originator does not know about. When a response is sent back it can arrive from an unknown NID and be dropped.
This limitation can be solved by allowing edge routers to queue a message while it performs discovery on the message originator. At that point, the router has all the information it needs to determine whether it can perform MR forwarding.
1. If both originator and destination are multi-rail capable with discovery enabled, then it can perform MR forwarding.
2. If not, fallback to the normal forwarding.
Another limitation was noted by Amir:
It's not going to be a good idea to put the "power" back in the hands of the routers. The routers should continue honouring the selection made by the peers. If not, it'll break at least one important UDSP use case, where you add a policy to prefer a specific interface on the final destination.
We could address this limitation by:
1. Making MR forwarding tunable and documenting its incompatibility with this particular UDSP policy (really not ideal).
2. Creating a new policy that could be enacted on routers to accomplish the same goal. i.e. when forwarding a message from peerA to peerB, prefer peerB's NIDs X, Y, ..., etc.
Another solution would be to require ACKs on all PUTs. With the response tracking code, if every message was ACK'd then messages sent to a failed interface would eventually experience a response timeout. This causes health of the remote peer NI to lower, and the client should then be able to select the health interface for future sends. This is certainly the easiest solution to implement, just a few lines of code, but obviously increases the load on the network.
Routed peers have no view of remote peer interface health. Failure of a remote interface causes serious performance problems. Three possible solutions (so far) in increasing order of difficulty:
1. ACK all messages and rely on response tracking to manage remote interface health.
2. Partial MR-forwarding - Allow routers to choose healthier interface, but otherwise do not modify destination.
3. Full MR-forwarding - Allow routers to use full MR selection criteria in choosing destination.
4. Have routers return "negative ACK" when host is unreachable.
5. Have routers propagate health information of their local peer's to remote peers.
6. Gossip protocol?
Other ideas for solutions are of course welcome. I'd like us to use this ticket to decide on the path forward.