[LU-16999] Revert caf6095ade LU-15595 lnet: LNet peer aliveness broken Created: 27/Jul/23  Updated: 24/Aug/23  Resolved: 24/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This patch restored the historic behavior of the LNet router peer health feature, but it did not account for the fact that the old lnet router checker behaved differently than the current implementation that leverages LNet discovery to perform the router checker pings. Because of this change to use discovery we can no longer guarantee that each router end point will be ping'd within peer aliveness window, and as a result the router may incorrectly determine that some peer NIs are not alive.

Just revert this for now



 Comments   
Comment by Gerrit Updater [ 27/Jul/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51791
Subject: LU-16999 lnet: Restore lpni aliveness check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 79b097fd3ae13e071f490a6eeb768c583bdf49a0

Comment by Gerrit Updater [ 24/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51791/
Subject: LU-16999 lnet: Restore lpni aliveness check
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 993d27d9ecc86bd030ca788bf9249485b11cdf8a

Comment by Peter Jones [ 24/Aug/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:31:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.