[LU-11382] Graceful Router Reboots Created: 14/Sep/18  Updated: 24/Jul/19  Resolved: 24/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

In a cluster with more than one router what is the most graceful way to reboot some or all of the routers, without causing client/server connective issues.

 

 



 Comments   
Comment by Peter Jones [ 14/Sep/18 ]

Sonia

Could you please advise?

Thanks

Peter

Comment by Sonia Sharma (Inactive) [ 18/Sep/18 ]

HI Mahmoud,

To reboot a router without affecting the client-server connectivity, ideally, we would need to ensure that an alternate routing path is available for client-server connectivity. And it would be good to bring down the routes configured with the router (remove the related routes configured on clients and servers) before rebooting that specific router. If there are multiple routers that need to be rebooted, it would be better to reboot them one by one instead of rebooting all at once.

Apart from this general advice, it would help to know the what the network configuration of the system looks like. There was a bug related to issues resulting after router reboots which got fixed in lustre 2.8. So it would also depend on the lustre version running on the nodes that whether or not a router reboot result in client-server connectivity issues.

A bug that I was referring to is -
https://jira.whamcloud.com/browse/LU-7646 - This bug is seen with mlx5 cards wherein when a router is rebooted, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. And this goes in an infinite loop.

Comment by Mahmoud Hanafi [ 24/Jul/19 ]

please close.

Generated at Sat Feb 10 02:43:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.