[LU-14566] Skip discovery in LNetPrimaryNID when lnet_peer_discovery_disabled is set Created: 26/Mar/21  Updated: 15/Jul/21  Resolved: 28/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14668 LNet: do discovery in the background Resolved
Rank (Obsolete): 9223372036854775807

 Description   

If discovery is disabled locally then the discovery thread will not
modify any peer objects as a result of the discovery process. Thus,
the primary NID of any peer we're asked to discover will not change
as a result of discovery. Therefore, we do not need to actually
perform discovery in LNetPrimaryNID() if discovery is disabled
locally. Since this routine can result in long client mount times
when a Lustre server is down we should avoid this unnecessary
discovery.



 Comments   
Comment by Gerrit Updater [ 26/Mar/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43141
Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eebda0deedd9a053f432b9638c70db2ce1cdf8ca

Comment by Amir Shehata (Inactive) [ 27/Mar/21 ]

Do you see this issue even with:

 LU-13972 o2iblnd: Don't retry indefinitely

?

Comment by Chris Horn [ 27/Mar/21 ]

That change will only impact local peers. Remote clients will still have to wait for the full lnet transaction timeout

Comment by Gerrit Updater [ 28/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43141/
Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16264da9e3c43a6368a25b6ded4113e8cfa57427

Comment by Peter Jones [ 28/Apr/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 04/May/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43537
Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 58cab53ba8c387722e638c097c0496af007f091f

Comment by Etienne Aujames [ 04/May/21 ]

We encountered an issue with 2 LNet routes missing on the server side (OSS): the clients could communicate with server but the servers could not answer.

Clients tried periodically to connect to the servers maintaining the missing peers in the discovery list (the_lnet.ln_dc_working). This have the consequences to wait indefinitely for peer discovery in ll_ostXX_XXX threads and progressively contaminating all the available threads (the client keep sending connection requests).

The server became unavailable for all the clients.

The "LNet discovery" and the "LNet health" is disabled on the clients and on the servers.

Generated at Sat Feb 10 03:10:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.