Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Multi-rail peer can have multiple local NIDs, but LNetDist() will only identify a NID as local if it is the first one returned by lnet_get_next_ni_locked().
Here's the code:
while ((ni = lnet_get_next_ni_locked(NULL, ni))) { if (ni->ni_nid == dstnid) { if (srcnidp != NULL) *srcnidp = dstnid; if (orderp != NULL) { if (dstnid == LNET_NID_LO_0) *orderp = 0; else *orderp = 1; } lnet_net_unlock(cpt); return local_nid_dist_zero ? 0 : 1; } if (LNET_NIDNET(ni->ni_nid) == dstnet) { /* Check if ni was originally created in * current net namespace. * If not, assign order above 0xffff0000, * to make this ni not a priority. */ if (current->nsproxy && !net_eq(ni->ni_net_ns, current->nsproxy->net_ns)) order += 0xffff0000; if (srcnidp != NULL) *srcnidp = ni->ni_nid; if (orderp != NULL) *orderp = order; lnet_net_unlock(cpt); return 1; } order++; }
If a peer has two nids on same net, x@o2ib and y@o2ib, then LNetDist() will return 0 for one of the NIDs and 1 for the other NID even though both NIDs are local.
This is evidenced by lctl which_nids always returning the first NI that is configured regardless of the order of arguments:
sles15c01:~ # lctl list_nids 192.168.2.38@tcp 192.168.2.39@tcp sles15c01:~ # lctl which_nid 192.168.2.38@tcp 192.168.2.39@tcp 192.168.2.38@tcp sles15c01:~ # lctl which_nid 192.168.2.39@tcp 192.168.2.38@tcp 192.168.2.38@tcp sles15c01:~ #