[LU-16214] Minimize dropping kfilnd messages at target due to stale peer Created: 05/Oct/22  Updated: 19/Jan/23  Resolved: 19/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When LNET is restarted at a target the kfilnd peer is marked as stale. An incoming message from the initiator peer is silently dropped and a peer handshake exchange is started.

While this works as designed, since kfilnd is connectionless, it is not optimal because kfilnd clients must rely on their error handing to figure out a message was dropped and then retry.

The purpose of this story is to determine how to minimize the occurrence of dropped kfilnd messages.

One thought is to have kfilnd proactively do a hello handshake on a send if a message hasn't been received from a peer for some period of time. Similar to what it does when it knows the peer is stale.



 Comments   
Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48785
Subject: LU-16214 kfilnd: Keep stale peer entries
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52f14a75c73fe21a5828a8de00f05dabe20bae34

Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48786
Subject: LU-16214 kfilnd: Proactively handshake old peers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 05dc08cb00cd29d1adbe3532fec75e8d15cd6d82

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48785/
Subject: LU-16214 kfilnd: Keep stale peer entries
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1f7eaa24f14aa567b80d99676c765db2b331d40

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48786/
Subject: LU-16214 kfilnd: Proactively handshake old peers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: de2536850ed2ecc2169dec4ccc458589314b2896

Comment by Peter Jones [ 19/Jan/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:25:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.