[LU-16451] kfilnd: Enhance TN state machine to handle peer in "failed" state Created: 06/Jan/23  Updated: 27/Jan/23  Resolved: 27/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

If a send request (immediate or bulk) fails with the EHOSTUNREACH error number means the cxi retry handler failed to send a message indicating an issue with the peer or the fabric.

If a kfilnd transaction (TN) fails with the EHOSTUNREACH error number, update the peer to a new "failed" state.

When a peer is in this failed state, require a completed HELLO before sending any more packets to that peer.

The idea is to minimize the amount of outstanding messages (which consume cxi resources) until either the peer recovers or the timeouts expire.



 Comments   
Comment by Gerrit Updater [ 10/Jan/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49589
Subject: LU-16451 kfilnd: Improve CQ error logging
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b04712ef0957b3a885c151c87adf5a2cc9aebeb

Comment by Gerrit Updater [ 10/Jan/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49591
Subject: LU-16451 kfilnd: Throttle traffic to down peers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 19b0120001224d4813e67ba20c6589ee5bc1b086

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49591/
Subject: LU-16451 kfilnd: Throttle traffic to down peers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c64484bb87eab9834e825927fbf18e658a2a8d57

Comment by Peter Jones [ 19/Jan/23 ]

Landed for 2.16

Comment by Peter Jones [ 19/Jan/23 ]

Oops - still one more patch left

Comment by Gerrit Updater [ 27/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49589/
Subject: LU-16451 kfilnd: Improve CQ error logging
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e500f49c302c6f10ba3b701e83db4da2b4b68a11

Comment by Peter Jones [ 27/Jan/23 ]

Now all appear to be merged

Generated at Sat Feb 10 03:27:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.