[LU-6851] LU-6060 patch breaks multi-hop routing by default. Created: 14/Jul/15 Updated: 20/Jan/17 Resolved: 02/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Change 531ef4d on b2_5_fe, the patch from The code makes this, to me, puzzling assumption: - rtr->lr_downis = down + (ptl_status == LNET_NI_STATUS_DOWN); + /* if @down is zero and this route is single-hop, it means + * we can't find NI for target network */ + if (down == 0 && rtr->lr_hops == 1) + down = 1; + First of all, lr_hops == 1 is the default, and lnet has never required hops to be set even on multi-hop systems. You are explicitly introducing a constraint that never existed in the past, and you have failed to communicate that new constraint to your customers. Adding this new constraint, while having avoid_asym_router_failure enabled by default, and hops defaulting to 1, is a bug plain and simple. Hops and priority were invented for fine grain routing. They are not, and never have been, required to be set in multi-hop lnet routing situations. The Lustre manual even shows them as optional. And speaking of the manual, the documented grammar for setting hops and priority are almost certainly wrong. routes=dest_lnet [hop] [priority] router_NID@src_lnet; \ dest_lnet [hop] [priority] router_NID@src_lnet Since both hop and priority are numbers, I kind of doubt that there is any way to specify either one without the other. I imagine the grammar is really either this: routes=dest_lnet [hop [priority]] router_NID@src_lnet; \ dest_lnet [hop [priority]] router_NID@src_lnet or this: routes=dest_lnet [hop priority] router_NID@src_lnet; \ dest_lnet [hop priority] router_NID@src_lnet Finally, what happened with the review process? I see no reviews listed on the 531ef4d patch, and no reference to the patch on master from which this was backported. And there were additional changes in the backported patch not found in the original, so a review really should have been required. |
| Comments |
| Comment by Peter Jones [ 14/Jul/15 ] |
|
Hi Chris Presumably this same logic flaw exists in the version landed to 2.7? IIRC this patch was one first pushed against b2_5 and was then upstreamed to master after extensive testing by ORNL. Amir can you comment on any differences between the 2.5.x and master versions of this patch? Thanks Peter |
| Comment by Christopher Morrone [ 14/Jul/15 ] |
|
Yes, I would guess that 2.7 and master have the same problem. |
| Comment by D. Marc Stearman (Inactive) [ 23/Jul/15 ] |
|
ORNL uses fine grained routing, and as far as I know does not do any multihop routing, so I wouldn't claim their testing was extensive. I believe that it fixed their issue, but that does not cover all cases. We have a few clusters that have multiple hops to go from IB < |
| Comment by Christopher Morrone [ 23/Jul/15 ] |
|
Actually, ORNL probably does have multiple hops as part of their fine grain routing. The fine grain routing is about (as far as I understand it), having multiple possible routes to a source of differing distance/performance, and prioritizing the best routes by default. So most of the time they are probably prioritizing the routes with the smallest number of hops. Because they use fine grain routing, they are probably using the optional hop and/or priority settings. It is likely that there was no testing done with the default values. |
| Comment by James A Simmons [ 23/Jul/15 ] |
|
Actually for our test bed systems we have no fine grain routing. In fact in my smallest non-cray setup I don't have any routers, straight server to client, and I did test it there with no problems. Our next machine up for testing does have routers but it is not a fine grain routing setup. Its a pretty basic setup and it also worked for that configuration. For our production system yes we do use fine grain routing and I believe we tweak both hops and priority settings. Can't say 100% since I don't set up the production machines. I just break them |
| Comment by Matt Ezell [ 23/Jul/15 ] |
|
Concern about multi-hop systems was brought up (see comment 3 of http://review.whamcloud.com/#/c/13162/), but we don't use it at ORNL. Since our fine-grained routing configs existed before the priority parameter even existed, we leave priority at the default value. Our hops parameter does not specify the actual network hop count, it's really used as a priority. We should eventually switch to using priority instead of hops for FGR. |
| Comment by Amir Shehata (Inactive) [ 24/Jul/15 ] |
|
The logic here (as I understand it) is as follows: Problem is that there are two scenarios where the check in (6) will evaluate to true: From the point of view of the pinger, there is no way to distinguish between these two scenarios without knowing whether you're dealing with multihop route or not. The assumption (as far as I can tell) which was made in the fix, is that if you're in a multihop system then you'll always define the number of hops in the route to > 1. If you don't define the number of hops (then you default to 1) and therefore the first hop has to have an interface on the destination network. From what I could gather from the bug description, is that there exists a system, which uses multihops, but the routes don't define the # of hops value, since it is optional, and therefore on pinging the router, the logic determines that it's down, and causes traffic to stop flowing. Is that true? INHO, I think that 6060 should be the responsibility of the network admin to determine and fix (technically it's a network issue), since there is no way to differentiate between the two scenarios I identified above. The patch for 6060 would work if the the # of hops in the route was mandatory, but it isn't. thoughts? |
| Comment by D. Marc Stearman (Inactive) [ 24/Jul/15 ] |
That would be a fair statement, except that the hop parameter has always been optional. LNET isn't like a traditional network with a default route. It is a fire-and-forget network. You hand the traffic to the router and hope that it has a route to get your data where you want it to go. The hop parameter was typically intended to provide backup routes that were further away, such as the way ORNL is using it for fine grained routing (before the priority value was added). It is okay to change the default behavior, but not in a maintenance release, and certainly not without changing the documentation. I have no problem making this a system admin decision and require the use of the hops parameter when setting up your network, but when a minor release breaks your existing installation, that is simply not the right thing to do. A better bit of logic, would do the opposite of 4. If the NI status of all of its NIDS are up, then set all the routes to up. We handle that in our init scripts. We use nodediag to check that all of our interfaces are up and at the proper speed based on our config files. If the running config does not match the expected config, we stop LNET to prevent this from happening. |
| Comment by Amir Shehata (Inactive) [ 24/Jul/15 ] |
|
As I indicated in my previous comment the patch pushed in for 6060 is not the correct approach due to the fact that # of hops is not a mandatory parameters. So I agree that changing the behavior in a maintenance release is not a good idea. I think the logic you're proposing is identical to simply removing this code bit: /* if @down is zero and this route is single-hop, it means * we can't find NI for target network */ if (down == 0 && rte->lr_hops == 1) down = 1; The result is that the code will loop through all the gateway's NI statuses and stop either: rte->lr_downis will be set to the number of down NIs. If all are up (or 2 above is true), then it would be zero. This should resolve the issue LLNL is having. However, this will effectively role back the change done for ORNL, so I'd like to get agreement from ORNL on this change since it'll impact them. Basically, there will be no way to determine if a router has no way to get a message to its final destination in the case when a router comes up without a particular interface. This can be resolved as you mentioned through scripts that check running config against expected config. With regards to LNet being a fire and forget network, it seems like having the router pinger feature and avoid_asym_router_failure option contradicts this view. |
| Comment by Christopher Morrone [ 24/Jul/15 ] |
|
lnet's whole approach to routes and asymmetric router failure detection seems pretty convoluted at this point. The approach in steps 4 and 5 might basically work, but they make my head hurt. I don't think we care what interfaces our peer node has; we care about what networks are reachable from our peer. It seems like reachable networks and hop counts could have been done completely automatically. Instead of specifying "routes" on nodes, we should really be calling them "gateways". We have sysadmins list "gateway" nodes, and then all of the routing should be pretty reasonable to build up using code rather than relying on error-prone humans to get it right. But fixing all of that is too involved for this ticket. Sigh. |
| Comment by Christopher Morrone [ 24/Jul/15 ] |
|
In the short term, you will need to revert the patch and work on a different approach to |
| Comment by James A Simmons [ 24/Jul/15 ] |
|
We will discuss the impact of reverting |
| Comment by Gerrit Updater [ 24/Jul/15 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/15719 |
| Comment by Andreas Dilger [ 05/Aug/15 ] |
|
Chris, is there a significant complexity to change the LLNL LNET config to have a hop count > 1? My understanding is that as long as the hop count for each route is the same on any client it should not affect the routing priority, so it would be possible to change your config to specify the "correct" hop count of 2 or 3 for each remote network without affecting the operation with or without the Another alternative is to either revert the patch only from the LLNL CHAOS tree, or only apply it to the ORNL tree until a long-term solution can be found. Are there other sites besides LLNL that are using multi-hop routing? I agree that changing the configuration requirements for multi-hop networks in a maintenance release is unpleasant, but I'm trying to find a path forward that allows proper functionality for both sites with a single code base. |
| Comment by Christopher Morrone [ 05/Aug/15 ] |
|
I don't understand why you guys are so reluctant to revert an obviously buggy patch. Revert the buggy patch and work on a less buggy approach to ORNL's problem. Period. I'll even suggest one way to fix it once you do the Right Thing here. |
| Comment by James A Simmons [ 06/Aug/15 ] |
|
I think the reason is reverting will also impact people. We don't know how many sites would be impacted by the revert compared to the current condition you appear to be the only one. ORNL has been in that situation where we were the only site having a problem so I know what it is like |
| Comment by Andreas Dilger [ 06/Aug/15 ] |
|
If the patch is reverted from b2_5_fe (since this is a maintenance release and shouldn't change behaviour) then that will give LLNL lots of time to fix up their routing configs to list multi-hop routes with hop count > 1, and time to update the documentation for the "hop" parameter (or ideally set hop > 1 automatically for such routes). Since that shouldn't affect current behaviour either way, it would be possible to change that well in advance of LLNL updating to a 2.7 or later release? |
| Comment by Christopher Morrone [ 06/Aug/15 ] |
|
Whether or not LLNL or any other customer knows about a fundamental new configuration burden will depend on how you go about documenting the change and communicating with your customers. But I would argue that this is the wrong approach. Lustre is already more difficult than it needs to be to configure and get working; this makes that even worse. People will misconfigure the hop count, and we don't have adequate tools in place to make clear why things are not working when that happens. Are you comfortable with the additional support burden? I am not. Here is my alternate suggestion: Instead of having lr_hops default to 1, have it default to a magic number that means that the value is not configured (e.g. -1). Then we can test whether the value has been configured before using it anywhere. Additionally, make clear in the documentation that voluntarily setting the hop count on routes to the correct value will unlock additional benefits. That would preserve the long standing lnet behavior (over a decade I presume), and allow those who want to go to the extra configuration effort to take advantage of additional benefits. And maybe someday someone will make lnet figure out hop counts automatically. |
| Comment by Matt Ezell [ 07/Aug/15 ] |
lnet_ni_status_t has an unused 32-bit unsigned member that could be used to store a hop-count. The router pings can send up to 16 NI statuses, so it could theoretically send all local NIs with hop count 0 and peer NIs with hop count 1+(value returned from peer). After several pings, it should be possible for LNET to know about all NIs on the connected fabrics and the distance to them. In our center, we have more than 16 NIs. So we would not be able to rely on this unless we increase the maximum ping RPC size. That doesn't sound desirable. I had ideas to use the the unused member for a load metric. ns_status uses hex magic numbers, so encoding it in there with a simple OR might not be straightforward. I guess hop count is more important. |
| Comment by James A Simmons [ 07/Aug/15 ] |
|
As I was thinking about using that field to support a versioning system for LNet. |
| Comment by Amir Shehata (Inactive) [ 10/Aug/15 ] |
|
It would be nice to settle on the action for this patch. Is it ok with everyone to revert this on master and 2.5fe, and then come up with a way for LNet to find out the number of hops on a specific path as a separate feature? In my opinion, that's the way to go. If there are no objections I will update the patch and push in gerrit later today, so we can include this in 2.8. |
| Comment by James A Simmons [ 10/Aug/15 ] |
|
Reverting still leaves it in a broken state. We really need a fix more than anything. |
| Comment by Amir Shehata (Inactive) [ 10/Aug/15 ] |
|
Are you looking for a fix in 2.8? or can we track it through a different LU? |
| Comment by Amir Shehata (Inactive) [ 11/Aug/15 ] |
|
James, I had a question, what does ORNL use the # hops for? |
| Comment by Christopher Morrone [ 12/Aug/15 ] |
|
There is no way that an automatic method to identify reachable networks and their associated hops counts can reasonably be introduced in time for 2.8. That needs careful design and implementation work. I think it would be excellent to work on that for a subsequent release. Changing the default to be "value not set" instead of 1 seems fairly reasonable to me in the 2.8 time frame. |
| Comment by Amir Shehata (Inactive) [ 12/Aug/15 ] |
|
I can make the change that Chris suggested, but I'm unsure how that'll affect ORNL. I don't have clarity on how they are using the hop count. Do they assume that the default is 1 and they expect the behavior to be the same as 6060, if so then they'll still have the same problem with Chris' suggested solution. And also I think the use of the hop count should be considered. If ORNL is using the hop count as some sort of priority, rather than an actual hop count, then it's my opinion, that it's not the intended usage of the hop count and we shouldn't be making changes to appease a case which is outside the normal usage scenario. There exists a priority parameter which can be used instead. |
| Comment by Matt Ezell [ 13/Aug/15 ] |
I don't have much background on the original intent of the hop count parameter. The comment by the struct is just "how far I am" which is certainly open to interpretation. One interpretation is the number of LNET hops (can the router talk to the destination NI directly, or might I pass through multiple routers to get there), which LNET could theoretically figure out automatically. A different interpretation is the number of hops within the underlying network, which we don't want LNET trying to figure out automatically. Consider an o2ib network: a router on the same leaf blade might have hop count 1, but a router on a different leaf blade that has to hop through a spine module might have hop count 3. Then consider something like Cray Gemini, where there might be many hops between a client and a router. Our implementation on Titan is kind of a hybrid between the two. Our primary routes, which clients should always use in absence of failure, use hop count 1. The backup routers are farther away (more Gemini network hops), but we use an arbitrary hop count of 10. This means that in the current state, our primary routers are checked to make sure all the required interfaces are present and working. Our backup routers will be unused if the interface is down, but it will still attempt to use them if LNET came up without a required interface. That's non-ideal, but that double failure pattern is extremely rare. Priority didn't exist when we implemented this. We can (and should) move to priority in the future. We need to pick what "hops" is going to mean going forward and make sure the code and documentation all match that assumption. Our production routers have external monitoring to make sure our IB interfaces are healthy, so it's not the end of the world if |
| Comment by Amir Shehata (Inactive) [ 13/Aug/15 ] |
|
Matt, thanks for the explanation. I believe hop was added with the routing feature. The Lustre Manual has this on it: "The hop parameter specifies the number of hops to the destination". My assumption here is that it refers to the number of LNet hops to get to the destination, based on the explanation in the manual and the fact that hops parameter was added as part of the routing feature. However, as far as I understand from your explanation, it appears that the hop count is explicitly set to 1 in ORNL configuration. Given that then switching the default to -1 to identify whether the hop count was explicitly set or not, should work for ORNL as well. I'll update a patch with the suggested solution. |
| Comment by Amir Shehata (Inactive) [ 15/Aug/15 ] |
|
So I pushed in a patch to default the number of hops to -1. However, I'm not happy with this solution. Currently the hops is used to decide on which route to select and to calculate the distance to the nearest peer. If the default is -1 for routes which don't explicitly specify the # of hops, how should these routes be considered in the comparison? Should they be preferred? or should they be least preferred? To maintain the existing behavior (which is essential in order not to cause existing sites that use routes to start to misbehave) the logic I implemented makes a route with a default hop count of -1 equivalent to a route with a hop count of 1. The benefit of the solution is that it avoids the 6060 issue given that routes are explicitly declared with hop count of 1. I still maintain from my reading of the code that hop count is not being used as it was intended to be used. The intention is for it to define the number of LNet hops to the destination. It was implemented as an optional parameter to give the user the flexibility to determine the route a message takes based on how far it is from the destination. From my understanding its usage has evolved to be more of a priority level. Ideally, hop counts should not be specified by the user because that gives an inaccurate view of the network. hop counts should be measured as traffic flows through the network and then routing is adjusted according to hop counts discovered. However, this is not a simple feature to implement and I'm not sure of the need for it. Furthermore, the way we do asymmetric router failure discovery seems wrong to me, which is essentially the core of the issue in 6060. We depend on the ping message, and when we get the ping info from the router we make a decision locally whether that gateway can get us to the destination network. However this is error prone as we can see in 6060 and compounding the issue is our dependance on the hop count which is in itself user specified and error prone. The semantics we currently use is that we ask the gateway, tell me all your network interfaces (but dont' tell me more than 16 of them - which is in itself a problem), and when we get the information we requested we try to determine if the gateway in question can get us to the destination network. The semantics of the query to the gateway should rather be: "Can you get to network x " This way the gateway is the one that's providing us with the answer The answer can be This way we free ourselves from the 16 network limitation in ping, and we never have to depend on the number of hops. And way we avoid convoluted logic to try and figure out whether the gateway is up or not We just query each gateway we have on our list with the destination network For example, if a route is defined as tcp1 <ip>@tcp2, then we query <ip>@tcp2 with tcp1 and let the gateway tell me if it can get to tcp1. |
| Comment by Christopher Morrone [ 17/Aug/15 ] |
|
I think that this is the best we can do to restore basic functionality in time for Lustre 2.8.0. I mean, seriously, our admins couldn't even mount Lustre any more due to the bug introduced by Yes, the hops value was originally used by ORNL as more of a priority than an lnet hop count. Later on the priority value was added and they probably should have shifted to using that, but old habits are hard to break. I completely understand how hard it can be to overcome inertia in a production setting. I agree that the asymmetric router failure detection design has issues. I am happy to talk about how to make it better, and how to perhaps make automate hop count and/or automate detection of reachable lnet networks. But I think that you should start a separate ticket or lustre-devel mailing list thread for that discussion. |
| Comment by Amir Shehata (Inactive) [ 17/Aug/15 ] |
|
If you look at the Lustre Manual section: 36.2.1.3 The default value is documented. So other sites could already be depending on the fact that hopcount is being defaulted to 1 for correct operations of their clustre. Given that I don't have enough data to determine how many sites that is, I think deviating from documented behavior is dangerous. To me, out of the solutions the least dangerous is to simply revert 6060 (which doesn't impact ORNL that much since they already have scripts which check the health of their interfaces). The issue here is to correctly get asymmetric router failure detection to work and not to change the existing documented LNet behavior. |
| Comment by Andreas Dilger [ 13/Oct/15 ] |
|
I think the best that we can do with this messy situation is as follows:
|
| Comment by James A Simmons [ 29/Oct/15 ] |
|
We had a discussion about the options Andreas. We have no problem now with the revert of |
| Comment by Christopher Morrone [ 30/Oct/15 ] |
|
I would like to see the 15719 change finished and landed for 2.8.0. Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us. For the Intel's 2.5 FE branch, yes please revert the |
| Comment by Christopher Morrone [ 30/Oct/15 ] |
|
I would like to see the 15719 change finished and landed for 2.8.0. Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us. For the Intel's 2.5 FE branch, yes please revert the |
| Comment by James A Simmons [ 05/Nov/15 ] |
|
Am I reading this right but basically you are stating that Andreas suggestion to fix up your scripts for the hop count is the wrong thing since it is not what you do. Earlier for the We compromise on the 2.5 |
| Comment by Christopher Morrone [ 05/Nov/15 ] |
|
We don't have scripts to generate lnet configurations per say, just configuration files. I don't think you are quite reading it right.
The changes that you made to use priority didn't effect LLNL in any way that I can see, and doesn't have all that much to do with I apologize if something I said caused misunderstanding that resulted in you doing work earlier than you had planned. Landing change 15719 is the compromise position. It allows the configuration lnet routing to continue to work as it has since its inception by default, while also at the same time providing the benefits of Change 15719 won't harm you, and it won't require us to accept new configuration burdens. Change 15719 is what the Change 15719 puts right what once went wrong. |
| Comment by James A Simmons [ 05/Nov/15 ] |
|
Since the hops configuration requirements far exceeds the effort level an system administration can handle and according to your experience with multi hop the best configuration is to ignore the hops setting then why don't we just rip out the hops code then. |
| Comment by Christopher Morrone [ 06/Nov/15 ] |
|
Because you and Cray wanted to build improvements to ARF on top of the hop count, and I'm all about compromise. But seriously, that is an interesting thought...I'm not entirely sure what other purpose it serves now that we have priority. Is there a real use case for having hops on top of priority for routing purposes? Can the ARF be made to work under the additional required failure modes without using hops? Somebody would have to give that more thought. 15719 would still be reasonable to land in the mean time, even if we decided to take hops out at some point. |
| Comment by Olaf Weber [ 11/Nov/15 ] |
|
My attempt to summarize, based on the thread above and the 15719 review comments.
In 15719 there are three proposals on the table for fixing this:
Option (3) is the simplest to implement. It is also guaranteed to break multi-hop configurations where only the non-default hopcounts have been specified. Option (2) requires a decision to be made on what LNetDist() should do when it sees an undefined hopcount. Option (1) minimizes the change to behavior. I count that as an advantage because an unanticipated change in behavior is what landed us in this mess in the first place. For that reason I strongly recommend going with this option. |
| Comment by Christopher Morrone [ 11/Nov/15 ] |
|
Thanks, Olaf, that is a very nice summary. I'm not sure about the accuracy of this sentence: "The code from Option 3 was an interesting idea, but you are correct that there is a guaranteed breakage. I also worry a bit about locking in the idea that a number in the valid hop count range doesn't actually represent the real lnet hop count. Options 1 and 2 have the same primary solution, and just differ on the secondary implementation details. I have been suggesting that primary solution all along, so I am certainly still in favor of it. I, of course recommended the secondary details in Option 2. But I'm not wed to them. I'll throw my support behind Option 1 as well. |
| Comment by Gerrit Updater [ 02/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15719/ |
| Comment by Joseph Gmitter (Inactive) [ 02/Dec/15 ] |
|
http://review.whamcloud.com/#/c/15719/ has landed for 2.8.0 |