James, are you referring to this hunk of code to initialize the lmv_qos_rr_index starting value:
/*
* initialize rr_index to lower 32bit of netid, so that client
* can distribute subdirs evenly from the beginning.
*/
while (LNetGetId(i++, &lnet_id, false) != -ENOENT) {
if (!nid_is_lo0(&lnet_id.nid)) {
lmv->lmv_qos_rr_index = ntohl(lnet_id.nid.nid_addr[0]);
break;
}
}
That code doesn't really need to have the full NID. The main goal is that each client is initialized in some way to a well-balanced starting value (instead of 0) so that clients don't all start creating subdirectories on MDT0000 and go in lockstep across MDTs. Using the NID is deterministic and likely gives us a more uniform distribution compared to a purely random number, because it normally only changes the low bits among nearby clients in a single job. It could use any other kind of value that is slowly changing for each client, but it would need to be available early during the mount process.
I don't know if IPv6 NIDs have the same property or not, or if they have too much "random" stuff in them that they may be wildly imbalanced?
We could potentially have the MDS assign each client a sequential "client number" for this purpose in the mount reply, but that might be imbalanced for clients that are allocated into the same job because it would only "globally" be uniform (though still better than a purely random number).
In any case, the lmv_qos_rr_index doesn't have to be perfect, as it drifts over time anyway, but it avoids the "thundering herd" problem at initial mount time.
A similar solution is needed for lmv_select_statfs_mdt() to select an MDT to send MDS_STATFS RPCs to:
for (i = 0;; i++) {
struct lnet_processid lnet_id;
if (LNetGetId(i, &lnet_id, false) == -ENOENT)
break;
if (!nid_is_lo0(&lnet_id.nid)) {
/* We dont need a full 64-bit modulus, just enough
* to distribute the requests across MDTs evenly.
*/
lmv->lmv_statfs_start = nidhash(&lnet_id.nid) %
lmv->lmv_mdt_count;
break;
}
}
and it probably makes sense that these both use the same mechanism instead of fetching the NIDs each time.
Work is complete