[LUDOC-494] Clarify when setting lnet route hops is required for Lustre 2.12 and Lustre 2.14 Created: 29/Jul/21  Updated: 08/Aug/23

Status: Open
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In a discussion on https://review.whamcloud.com/#/c/43127/, in response to:

Note that bit of code is requiring that hop count be set for some routes, when they did not need to be set before (in lustre 2.12)

Chris said:

"Yes, good point. I think there was always an implicit requirement that hop count be set for multi-hop routes if the avoid_asym_route_failure feature was enabled, but we should make that explicit."

However this isn't reflected in the manual or lnetctl(8). (yet)

 



 Comments   
Comment by Olaf Faaland [ 29/Jul/21 ]

Related to https://jira.whamcloud.com/browse/LU-14555

Comment by Peter Jones [ 07/Aug/21 ]

Serguei

Could you please advise on what changes should be made to the manual here?

Thanks

Peter

Comment by Serguei Smirnov [ 10/Aug/21 ]

Hi,

I reviewed the current related documentation, listed below are the recommended changes:

The lnetctl section of the Lustre manual and lnetctl man page should be updated to mention that the hop count defaults to 1 if not specified when adding a route with lnetctl.

Also, the manual should be updated to clarify that "avoid_asym_route_failure" module parameter applies only to single-hop routers. 

Also, the following passage from 34.3.7. LNet Peer Health should be modified: 

"A router is considered down if any of its NIDs are down. For example, router X has three NIDs: Xnid1, Xnid2, and Xnid3. A client is connected to the router via Xnid1. The client has router checker enabled. The router checker periodically sends a ping to the router via Xnid1. The router responds to the ping with the status of each of its NIDs. In this case, it responds with Xnid1=up, Xnid2=up, Xnid3=down. If avoid_asym_router_failure==1, the router is considered down if any of its NIDs are down, so router X is considered down and will not be used for routing messages. If avoid_asym_router_failure==0, router X will continue to be used for routing messages."  

The above sounds incorrect to me now, because the router shouldn't be considered down unless it cannot reach remote net.

Thanks,

Serguei.

 

Comment by Olaf Faaland [ 13/Aug/21 ]

Hi Serguei,

We'll be happy to review the patches.

Thanks

Comment by Gerrit Updater [ 14/Sep/21 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44916
Subject: LUDOC-494 lnet: clarify use of route hopcount
Project: doc/manual
Branch: master
Current Patch Set: 1
Commit: 48588c2fcdc74e48caca530f7e38f3036143ea95

Comment by Gerrit Updater [ 08/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/doc/manual/+/44916/
Subject: LUDOC-494 lnet: clarify use of route hopcount
Project: doc/manual
Branch: master
Current Patch Set:
Commit: f7da09ba79b2522ca51d001c59ab1212d051309c

Generated at Sat Feb 10 03:43:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.