[LU-13606] Poor performance with routed clients to multi-rail servers when single server interface fails Created: 27/May/20 Updated: 17/Feb/21 Resolved: 11/Jul/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I'm opening this as a critical because in my testing (mix of IOR & mdtest) this issue causes serious I/O performance problems and occasionally client evictions. The problem is that routed clients do not have a view of the health of remote server interfaces, but clients are responsible for selecting the server interface a message will be sent to. Since clients don't have credits or health tracking of remote interfaces they simply round-robin across all available interfaces. Thus, if a server has two interfaces, and one of them fail, then approximately half of all messages sent by the client to the server will be destined for failure. This can cause all progress on jobs to basically halt as clients get into a reconnection loop with targets. 1. Some rpc sent to bad interface. I think you can get unlucky with reconnects if you have failover partners. This has also unsurprisingly resulted in client evictions which we know can often lead to job loss as many programs do not check for EIO and re-try. I discussed this issue with Amir and he proposed the following solution:
This is a non-trivial solution. There will be scalability concerns with having routers push updates to all peers. We'll also need to account for many routers pushing the same information. I have an alternative proposal with its own downsides, but the major upside is it is very easy to implement (I already have patch in hand for it). My proposal is to allow routers to perform what I'll call multi-rail forwarding. As mentioned earlier, clients currently have the responsibility for selecting the server interface a message will be sent to. MR forwarding would allow edge routers to make this decision instead. Edge routers, a.k.a. the final hop gateway, are able to leverage LNet health to determine the health of their local peers' interfaces. Thus, if we allow them to select the destination interface then we can avoid sending traffic to interfaces that have failed once the failure has been registered by the edge routers. The problem is that an edge router may not know whether the originator of a message has discovered the destination. As such, the router may forward the message to an interface the originator does not know about. When a response is sent back it can arrive from an unknown NID and be dropped. This limitation can be solved by allowing edge routers to queue a message while it performs discovery on the message originator. At that point, the router has all the information it needs to determine whether it can perform MR forwarding. Another limitation was noted by Amir:
We could address this limitation by: Another solution would be to require ACKs on all PUTs. With the response tracking code, if every message was ACK'd then messages sent to a failed interface would eventually experience a response timeout. This causes health of the remote peer NI to lower, and the client should then be able to select the health interface for future sends. This is certainly the easiest solution to implement, just a few lines of code, but obviously increases the load on the network. To recap: Other ideas for solutions are of course welcome. I'd like us to use this ticket to decide on the path forward. |
| Comments |
| Comment by Chris Horn [ 27/May/20 ] |
|
MR forwarding proof of concept - https://review.whamcloud.com/#/c/38734/ |
| Comment by Andreas Dilger [ 29/May/20 ] |
|
I'd like Amir to comment on why it makes sense to have UDSP force sending to a bad destination interface on the router that it has no health information about? It seems to me that if "MR routing" was only used when the router detected a bad destination interface, then UDSP would be happy 99% of the time because the (working) destination interface would be selected as it desires, and if the destination interface is bad then it doesn't make sense to send the traffic there, regardless of what UDSP wanted? In that case, the router should instead send to the working server interface and we accept that the performance is not going to be as good when one interface is down? That keeps the health information within the subnet where the router and the destination have a direct communication channel, and avoids the need to propagate this to every peer in the network. In the end, UDSP would have had to been informed about the bad destination interface, and make the same decision, so waiting for that to happen seems sub-optimal. Even with the Gossip implementation prototype, it was only monitoring health status between direct peers on the same LNet, and then forwarding server state to remote clients, rather than trying to account for all of the possible combinations of routes between every client and every server. I'm not against reviving the LNet Gossip implementation that was used in the original DAOS prototype to improve server/client health monitoring, but I don't think it makes sense to require the clients be omniscient to make every decision about the route. |
| Comment by Amir Shehata (Inactive) [ 29/May/20 ] |
|
My quoted comments in the ticket description are out of context. Initially when Chris and I were discussing this, the proposal I understood was to allow the gateway to do MR Routing all the time. Hence my note about breaking UDSP policy. The other concern I had with this approach is we don't have a consistent rule in LNet. With the mr forwarding parameter on, the originator will look like it's selecting the interface, but that selection will be over written by the edge gateway. And in case of policies, they'll look like they are working but they wouldn't be really. I'm not in support of having inconsistent behavior. I'm with Andreas, that we allow the edge gateways to overwrite the decision made by the originator only if the there exists a healthier interface on the final destination, then that would be ok. I wouldn't add a parameter at all in this case. This would be LNet's default behavior. IE: edge gateways will always honor the final destination in the message except when there exists an interface which is healthier. And in this case it has to log the change in behavior The issue with this is if the originator is non-MR. The edge gateway might not have discovered that the originator of the message is Non-MR. Let's take the situation when this is the first message the gateway is forwarding from the originator. The gateway doesn't know that the originator is Non-MR. If it ends up forwarding it to a different interface other than the one specified in the message, the entire RPC will fail. In this case I'm not convinced it is a better solution to introduce another path where we discover in the reverse direction - IE discover a node which we received from. We only discover nodes we're sending to. Why not simply look at the current state of the peer. A peer on the edge router is created for the originator as non-MR. However, when the edge router forwards messages to it, it'll discover it, at which point it will know whether it's MR or Non-MR. This way we err on the side of caution. I'm trying to avoid another special case in the code. Likelyhood is that the clients and the servers MR status will be known to the routers during mount time anyway, so the window where the router will not select a healthy interface during a workload is not there. |
| Comment by Alexey Lyashkov [ 29/May/20 ] |
|
> nother solution would be to require ACKs on all PUTs. why don't return an just "negative ACK" or special message when host is in unreachable? lnet message is routable - so any number hops not a problem in this case. |
| Comment by Chris Horn [ 01/Jun/20 ] |
I think I did a poor job of relating Amir's ideas and insights in the context of my MR forwarding proposal. Amir certainly was not suggesting that we should force sending to a bad destination. I apologize for the confusion.
This is a good point. There is indeed a very narrow window where a router cannot safely select a health interface. The discovery messages between client and server will not trigger discovery on the router as it forwards those messages, but subsequent traffic will, so the router should, in short order, be able to choose interfaces appropriately based on health. There may be additional benefits w.r.t. load balancing in allowing routers to choose new destination NIDs based on other criteria, credits, etc., but perhaps that benefit is minimal and not worth the potential headache it causes for UDSP.
I think this is another good idea and I've added it to the list. shadow I know this is something you've been thinking about for along time. Do you have a patch or code you can share? Can you estimate how much work is involved? I think in the short-term, modifying the router forwarding logic in the manner described by Amir, is a relatively easy fix. I propose we move forward with that approach while other enhancements in this area, e.g. negative ack, more robust health info sharing, etc., can be explored. |
| Comment by Gerrit Updater [ 01/Jun/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38798 |
| Comment by Gerrit Updater [ 10/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38798/ |
| Comment by Peter Jones [ 11/Jul/20 ] |
|
Landed for 2.14 |