[LU-14827] Allow Lnet peer entries to be updated if peer's NIDs change Created: 07/Jul/21 Updated: 07/Jul/21 Resolved: 07/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Uwe Sauter | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
My knowledge about Lustre is limited so please correct me where necessary. Imagine the following situation: You have a Lustre (2.14) file system running and the clients can access Lustre. Now you want to resolve some issues with startup order on the clients. And doing so you get the order wrong in which lnet and lustre modules are loaded and configured. In my particular case the lustre module was loaded before Lnet configuration for Infiniband was done so the lustre module configured an Lnet on ethernet, yet there is no connection between client and Lustre server ethernet. This resulted in having two NIs configured (@tcp and @o2ib) per client where @tcp is the primary NID. The Lustre servers will happily accept these peer configurations but Lustre operation gets slower because the servers will try to reach the clients via @tcp first Having spotted that mistake and corrected the order in which Lnet is configured and the Lustre module is loaded the clients then only get one NI configured (@o2ib) which naturally is the primary NID. But the Lustre servers do not update the Lnet peer entries already discovered and keep a primary NID of @tcp for the clients. And thus the servers will try to connect to the clients using @tcp. A fools resolution would just remove the peer entry on the Lustre servers and instantly add back a correct entry. But this leads to hiccups that influence the whole file system, possibly leading to reboots of the Lustre servers. So the solutions to this situation that I can think of are:
Is one or the other possible or is a primary NID more than just "the first interface for that peer"? Is there another way to remove wrong entries in a Lustre server's peer configuration (other than rebooting)?
Thanks,
Uwe |
| Comments |
| Comment by Peter Jones [ 07/Jul/21 ] |
|
As none of the cloud offerings that use this project use Lustre 2.14 I am guessing that this issue is intended to be in the LU project and will move it accordingly |
| Comment by Uwe Sauter [ 07/Jul/21 ] |
|
Yes, that was my mistake while creating the ticket. Thank you. |
| Comment by Uwe Sauter [ 07/Jul/21 ] |
|
I also must correct my assumption that I was using 2.14, actually this was 2.12.6 modified by DDN.
|
| Comment by Peter Jones [ 07/Jul/21 ] |
|
Then please open a ticket through DDN support channels. If the bug affects the community releases then the fix will be upstreamed in due course. |