[LU-9427] Have Routers Manage Ping Rates via Ping Reply Created: 01/May/17  Updated: 02/May/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Doug Oucharek (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

Today, if a customer wants to change the router_checker ping rate, they need to change the module parameters on every client/server and reload LNet.  That is painful on a large production cluster.

This ticket proposes we have routers put a timeout value in the ping replies instructing the clients/servers when next to ping.  Then, if this can be controlled via lnetctl dynamically, customers can change the ping rate by just issuing an lnetctl command on the handful of LNet routers and do not have to change clients/servers at all.

This will also be useful later on with LNet Health as congestion detection in a router can trigger a larger timeout value on all router ping replies.



 Comments   
Comment by Doug Oucharek (Inactive) [ 02/May/17 ]

The OPA developers tell me that is it better to have all the pings "batch" up at the same time rather than to have them staggered out over time.  This feature could allow the routers to coordinate the intervals to get them synchronized for efficiency. 

Comment by Andreas Dilger [ 02/May/17 ]

Sounds very good. I'm definitely not a fan of the current "need all clients and servers to have the same tunables set" behavior of o2iblnd. The more dynamic (or at least accepting of differences between peers) that this code can be, the better.

The question of batching vs. staggering pings is interesting. Why is batching better? Power, jitter, something else?

Generated at Sat Feb 10 02:26:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.