[LU-16823] add LNet and OBD connect flags for IPv6 peers Created: 11/May/23 Updated: 07/Jan/24 Resolved: 03/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | IPv6 | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
When nodes are connecting to peers the sender should set a bit in the connection that it supports IPv6 (large) NIDs. This would inform LNet discovery whether it is safe to reply with large NIDs during connection or ping replies. At the Lustre level new clients should set an OBD_CONNECT2_LARGE_NID=0x100000000ULL flag in obd_connect_data so that the MGS knows whether it can safely reply with large NIDs in mgs_nidtbl_entry. That avoids the need to backport a patch (ala |
| Comments |
| Comment by Andreas Dilger [ 11/May/23 ] |
|
I haven't looked into all of the details here, but this would potentially allow a better alternative to patching old clients to ignore large NIDs. |
| Comment by James A Simmons [ 17/Jun/23 ] |
|
https://review.whamcloud.com/#/c/fs/lustre-release/+/51108 covers the goal of this ticket. |
| Comment by James A Simmons [ 20/Jun/23 ] |
|
While 51108 landed we do need one more patch here to communicate that the MGS supports large NIDs. |
| Comment by Andreas Dilger [ 21/Jun/23 ] |
|
James, adding the handling for the OBD_CONNECT2_LARGE_NID feature is relatively straight forward:
When the client and MGS support for handling large NIDs is finished, a patch should be pushed that:
We don't want to land the "supported" patch until the code is (substantially) working, otherwise a client/server might advertise their support for this feature, but not actually work yet. |
| Comment by James A Simmons [ 08/Sep/23 ] |
|
I was looking at adding the final touches for this work and I found one place its not a easy replacement. For lmv_setup() we call LNetGetId() which only gets small size NIDs. I don't see an easy way to get the connect flag here. Any suggestions? Should we move to another setup function that has an export as a parameter? |
| Comment by Andreas Dilger [ 08/Sep/23 ] |
|
James, are you referring to this hunk of code to initialize the lmv_qos_rr_index starting value:
/*
* initialize rr_index to lower 32bit of netid, so that client
* can distribute subdirs evenly from the beginning.
*/
while (LNetGetId(i++, &lnet_id, false) != -ENOENT) {
if (!nid_is_lo0(&lnet_id.nid)) {
lmv->lmv_qos_rr_index = ntohl(lnet_id.nid.nid_addr[0]);
break;
}
}
That code doesn't really need to have the full NID. The main goal is that each client is initialized in some way to a well-balanced starting value (instead of 0) so that clients don't all start creating subdirectories on MDT0000 and go in lockstep across MDTs. Using the NID is deterministic and likely gives us a more uniform distribution compared to a purely random number, because it normally only changes the low bits among nearby clients in a single job. It could use any other kind of value that is slowly changing for each client, but it would need to be available early during the mount process. I don't know if IPv6 NIDs have the same property or not, or if they have too much "random" stuff in them that they may be wildly imbalanced? We could potentially have the MDS assign each client a sequential "client number" for this purpose in the mount reply, but that might be imbalanced for clients that are allocated into the same job because it would only "globally" be uniform (though still better than a purely random number). In any case, the lmv_qos_rr_index doesn't have to be perfect, as it drifts over time anyway, but it avoids the "thundering herd" problem at initial mount time. A similar solution is needed for lmv_select_statfs_mdt() to select an MDT to send MDS_STATFS RPCs to:
/* choose initial MDT for this client */
for (i = 0;; i++) {
struct lnet_processid lnet_id;
if (LNetGetId(i, &lnet_id, false) == -ENOENT)
break;
if (!nid_is_lo0(&lnet_id.nid)) {
/* We dont need a full 64-bit modulus, just enough
* to distribute the requests across MDTs evenly.
*/
lmv->lmv_statfs_start = nidhash(&lnet_id.nid) %
lmv->lmv_mdt_count;
break;
}
}
and it probably makes sense that these both use the same mechanism instead of fetching the NIDs each time. |
| Comment by James A Simmons [ 11/Sep/23 ] |
|
Then tendency is for IPv6 to be very random in an network. Need to think about a solution for this. Looking at lmv_setup() its processing an lcfg that contains a lmv_desc. Perhaps we can in mgs_llog.c create a record for lmv_desc with new info for the index to be used? |
| Comment by Gerrit Updater [ 10/Dec/23 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53398 |
| Comment by Gerrit Updater [ 03/Jan/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53398/ |
| Comment by James A Simmons [ 03/Jan/24 ] |
|
Work is complete |