[LU-5570] Better router selection in LNet Created: 02/Sep/14 Updated: 13/Feb/19 Resolved: 13/Feb/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Liang Zhen (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 15533 | ||||||||
| Description |
|
LNet chooses routers based on queued bytes on routers, at the meanwhile, it normally takes tens of seconds to detect dead routers (we see failed completion event of outstanding tx/rx, then close connection and notify LNet peer is dead) , which means it is still possible to queue more messages to a potentially dead router if all other alive routers have long message queue. we may need to check aliveness timestamp as part of router evaluation, and avoid to choose those routers that are inactive for certain number of seconds as long as there are other active routers (it takes pretty long to mark a router as dead, we might prefer not to choose it before marking it as dead) |
| Comments |
| Comment by Liang Zhen (Inactive) [ 04/Sep/14 ] |
|
Isaac, Amir, I just posted an experimental patch: http://review.whamcloud.com/11748 |
| Comment by Isaac Huang (Inactive) [ 08/Sep/14 ] |
|
How is it different from setting router_ping_timeout to a lower value (and thus detecting dead/congested routers earlier)? |
| Comment by Liang Zhen (Inactive) [ 09/Sep/14 ] |
|
if so, then more likely we will drop already queued messages even the router is still alive? I also think It's helpful if user can see last_alive of routers on clients/servers. |
| Comment by Isaac Huang (Inactive) [ 09/Sep/14 ] |
|
In case of a false dead router due to low timeout, we don't abandon messages already queued to it, do we? I thought we just stop giving it more messages, and leave those already queued intact. |
| Comment by Liang Zhen (Inactive) [ 09/Sep/14 ] |
|
but when we finalise message and return credit, we may drop queued message because router is down? hmm... if without this patch, then this is only happen on router, so you are correct, this is not an issue. |
| Comment by Isaac Huang (Inactive) [ 09/Sep/14 ] |
|
The dead_router_check_interval can be changed at run time, not a big deal really. My point is, the added code complexity seemed to out weight the benefits from the patch. If you want to go forward, why not also avoid unnecessary pings (e.g. no need to ping if last_alive is very recent) - then the additional benefit of reduced pings would make it more worthwhile. |
| Comment by Isaac Huang (Inactive) [ 09/Sep/14 ] |
|
Also, with aliveness for routers, it'd be possible to fix |
| Comment by Liang Zhen (Inactive) [ 11/Sep/14 ] |
|
I would think we should always avoid configuration changes when it's possible, we already have too many tunables which is overkill and very hard for users to make them all corrects. I totally agree it 's better to fix |
| Comment by Alexey Lyashkov [ 11/Sep/14 ] |
|
I think we should don't queue any messages in case negative credits. In that case we will queue only data able to send and easy to put other data to the different routers. |
| Comment by Liang Zhen (Inactive) [ 11/Sep/14 ] |
|
In current lnet, if there is any router has positive credit, we will queue message to it not to router with negative credits. |
| Comment by Alexey Lyashkov [ 11/Sep/14 ] |
|
it's not true, i have a lots crash dumps with negative credits per destination when router dead. |
| Comment by Gerrit Updater [ 04/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11748/ |
| Comment by Jodi Levi (Inactive) [ 08/Jan/15 ] |
|
Patch landed to Master. If there is more work to be done in this ticket, please reopen the ticket. |
| Comment by Gerrit Updater [ 09/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13302 |
| Comment by Liang Zhen (Inactive) [ 09/Jan/15 ] |
|
I just requested Oleg to revert it, because this patch is not cleanly rebased, also we need Isaac to review it. |
| Comment by Gerrit Updater [ 09/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13302/ |
| Comment by Gerrit Updater [ 12/Jan/15 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13342 |
| Comment by James A Simmons [ 12/Jan/15 ] |
|
Do you think this could help with the ARF problems we have been having? Earlier comment seem to point to that. |
| Comment by Liang Zhen (Inactive) [ 12/Jan/15 ] |
|
James, probably not, I still need your help to sample NI status on router (my last comment on |
| Comment by Amir Shehata (Inactive) [ 18/Dec/18 ] |
|
This won't be needed with the Router re-work I have done. LNet Health currently uses the legacy routing code. I reworked the routing code to bring it inline with the Multi-Rail. These changes should resolve the issue described in this ticket. |