[LU-14827] Allow Lnet peer entries to be updated if peer's NIDs change Created: 07/Jul/21  Updated: 07/Jul/21  Resolved: 07/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Uwe Sauter Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

My knowledge about Lustre is limited so please correct me where necessary.

Imagine the following situation: You have a Lustre (2.14) file system running and the clients can access Lustre.

Now you want to resolve some issues with startup order on the clients. And doing so you get the order wrong in which lnet and lustre modules are loaded and configured. In my particular case the lustre module was loaded before Lnet configuration for Infiniband was done so the lustre module configured an Lnet on ethernet, yet there is no connection between client and Lustre server ethernet.

This resulted in having two NIs configured (@tcp and @o2ib) per client where @tcp is the primary NID. The Lustre servers will happily accept these peer configurations but Lustre operation gets slower because the servers will try to reach the clients via @tcp first
(and vice versa).

Having spotted that mistake and corrected the order in which Lnet is configured and the Lustre module is loaded the clients then only get one NI configured (@o2ib) which naturally is the primary NID. But the Lustre servers do not update the Lnet peer entries already discovered and keep a primary NID of @tcp for the clients. And thus the servers will try to connect to the clients using @tcp.

A fools resolution would just remove the peer entry on the Lustre servers and instantly add back a correct entry. But this leads to hiccups that influence the whole file system, possibly leading to reboots of the Lustre servers.

So the solutions to this situation that I can think of are:

  • allow lnetctl to remove primary NIDs from peer entries
  • dynamically update a peer entry if a peer reconnects with a different configuration

Is one or the other possible or is a primary NID more than just "the first interface for that peer"?

Is there another way to remove wrong entries in a Lustre server's peer configuration (other than rebooting)?

 

Thanks,

 

Uwe



 Comments   
Comment by Peter Jones [ 07/Jul/21 ]

As none of the cloud offerings that use this project use Lustre 2.14 I am guessing that this issue is intended to be in the LU project and will move it accordingly

Comment by Uwe Sauter [ 07/Jul/21 ]

Yes, that was my mistake while creating the ticket. Thank you.

Comment by Uwe Sauter [ 07/Jul/21 ]

I also must correct my assumption that I was using 2.14, actually this was 2.12.6 modified by DDN.

 

Comment by Peter Jones [ 07/Jul/21 ]

Then please open a ticket through DDN support channels. If the bug affects the community releases then the fix will be upstreamed in due course.

Generated at Sat Feb 10 03:13:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.