[LU-16213] kfilnd: Optimize issuing of hello messages to a peer Created: 05/Oct/22  Updated: 19/Jan/23  Resolved: 19/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

kfilnd <> kfabric <> kcxi_prov <-> Cassini has the following issue. If LNet is trying to send to a kfilnd peer which is down, the Cassini retry handler can take up to 60 seconds to cancel the corresponding message. When retrying, the Cassini retry handler takes control of hardware resources and does not release them until the retries are complete. If enough of these messages are sent to down peers, it is possible the Cassini retry handler can take control of all available hardware resources. Once this happens, Cassini cannot process new RDMA commands and back pressure will start occuring in kcxi_prov which gets propagated to kfilnd as an -EAGAIN. As seen on JT, this can results in single RDMA operations taking minutes to complete.

To help prevent this issue, kfilnd should only send to peers it knows are up.

Looking at the kfilnd code today, I believe we can have multiple hello messages inflight to a single. If the peer is down, this can result in a build up of hello messages where each hello message will take the CXI retry handler 60 seconds to complete. There should only be a single hello message inflight per peer.

As a part of this work, kiflnd transaction may have to be queued until a hello message comes back. If the hello message results in a failure, all queued transactions should be finalized.



 Comments   
Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48780
Subject: LU-16213 kfilnd: Rename struct kfilnd_peer members
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 44087d1f7a2110d12cba7e19fe34c3b4ecadf9d9

Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48781
Subject: LU-16213 kfilnd: Add peer info to some debug statements
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c37a74c68a47c830ba662d223229acf2c88e8ae0

Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48782
Subject: LU-16213 kfilnd: Fail sends of particular message type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0c7af25572427b7b6c46e0e8548ef6856fb69e09

Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48783
Subject: LU-16213 kfilnd: Allow one HELLO in-flight per peer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4071f4148546da4d4496d7d929600c90b32aab46

Comment by Gerrit Updater [ 05/Oct/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48784
Subject: LU-16213 kfilnd: Finalize replay TNs with deleted peer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78d58c06f3f90cc81f9327b7fd614056ad2a4fea

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48780/
Subject: LU-16213 kfilnd: Rename struct kfilnd_peer members
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 679e73db770d188f43aa4d50592d65e337ad135e

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48781/
Subject: LU-16213 kfilnd: Add peer info to some debug statements
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ba0e08cfdc5cfc1b7f1fc368916ff14e229e0b29

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48782/
Subject: LU-16213 kfilnd: Fail sends of particular message type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 35747c871df1c2e97d415cb7c3601e045a58c8e6

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48783/
Subject: LU-16213 kfilnd: Allow one HELLO in-flight per peer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 11a32d886b3c9b7c3c9a6ec5a6ebdc2786ef1c71

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48784/
Subject: LU-16213 kfilnd: Finalize replay TNs with deleted peer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 08bbe9e562c403f247a74e99101d238398df6351

Comment by Peter Jones [ 19/Jan/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:25:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.