[LU-9238] Enhancement for route failure detection Created: 21/Mar/17  Updated: 01/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Chris Horn Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

I've been thinking about ways to enhance route failure detection since the asymmetric route failure detection doesn't do much for multi-hop configurations. The idea I had was to extend the lnet ping info to include route up/down status. This way peers could get route status of their next hop and use that information in selecting an appropriate next hop for future sends. Furthermore, in multi-hop configurations any bad hop on the route should eventually percolate to all peers that use that route. This isn't an ideal solution since it requires a wire protocol change, but I thought I would open this ticket to discuss further or maybe we can come up with another option.



 Comments   
Comment by Andreas Dilger [ 23/Mar/17 ]

This may overlap with the LNet Multi-Rail and/or Dynamic Discovery work, as well as the proposed LNet Resiliency project.

Amir, could you please comment when you have a chance.

Generated at Sat Feb 10 02:24:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.