[LU-16451] kfilnd: Enhance TN state machine to handle peer in "failed" state Created: 06/Jan/23 Updated: 27/Jan/23 Resolved: 27/Jan/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
If a send request (immediate or bulk) fails with the EHOSTUNREACH error number means the cxi retry handler failed to send a message indicating an issue with the peer or the fabric. If a kfilnd transaction (TN) fails with the EHOSTUNREACH error number, update the peer to a new "failed" state. When a peer is in this failed state, require a completed HELLO before sending any more packets to that peer. The idea is to minimize the amount of outstanding messages (which consume cxi resources) until either the peer recovers or the timeouts expire. |
| Comments |
| Comment by Gerrit Updater [ 10/Jan/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49589 |
| Comment by Gerrit Updater [ 10/Jan/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49591 |
| Comment by Gerrit Updater [ 19/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49591/ |
| Comment by Peter Jones [ 19/Jan/23 ] |
|
Landed for 2.16 |
| Comment by Peter Jones [ 19/Jan/23 ] |
|
Oops - still one more patch left |
| Comment by Gerrit Updater [ 27/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49589/ |
| Comment by Peter Jones [ 27/Jan/23 ] |
|
Now all appear to be merged |