[LU-14649] LNetDist() may not return 0 for local NID Created: 28/Apr/21  Updated: 22/Jun/21  Resolved: 22/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Multi-rail peer can have multiple local NIDs, but LNetDist() will only identify a NID as local if it is the first one returned by lnet_get_next_ni_locked().

Here's the code:

        while ((ni = lnet_get_next_ni_locked(NULL, ni))) {
                if (ni->ni_nid == dstnid) {
                        if (srcnidp != NULL)
                                *srcnidp = dstnid;
                        if (orderp != NULL) {
                                if (dstnid == LNET_NID_LO_0)
                                        *orderp = 0;
                                else
                                        *orderp = 1;
                        }
                        lnet_net_unlock(cpt);

                        return local_nid_dist_zero ? 0 : 1;
                }

                if (LNET_NIDNET(ni->ni_nid) == dstnet) {
                        /* Check if ni was originally created in
                         * current net namespace.
                         * If not, assign order above 0xffff0000,
                         * to make this ni not a priority. */
                        if (current->nsproxy &&
                            !net_eq(ni->ni_net_ns, current->nsproxy->net_ns))
                                        order += 0xffff0000;
                        if (srcnidp != NULL)
                                *srcnidp = ni->ni_nid;
                        if (orderp != NULL)
                                *orderp = order;
                        lnet_net_unlock(cpt);
                        return 1;
                }

                order++;
        }

If a peer has two nids on same net, x@o2ib and y@o2ib, then LNetDist() will return 0 for one of the NIDs and 1 for the other NID even though both NIDs are local.

This is evidenced by lctl which_nids always returning the first NI that is configured regardless of the order of arguments:

sles15c01:~ # lctl list_nids
192.168.2.38@tcp
192.168.2.39@tcp
sles15c01:~ # lctl which_nid 192.168.2.38@tcp 192.168.2.39@tcp
192.168.2.38@tcp
sles15c01:~ # lctl which_nid 192.168.2.39@tcp 192.168.2.38@tcp
192.168.2.38@tcp
sles15c01:~ #


 Comments   
Comment by Gerrit Updater [ 29/Apr/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43498
Subject: LU-14649 lnet: Correct distance calculation of local NIDs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a95973cc5ab1c121f2603f61697ba044327dcf20

Comment by Gerrit Updater [ 21/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43498/
Subject: LU-14649 lnet: Correct distance calculation of local NIDs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3b263dd80ee56efae922e2cfcab375dbe2cb273a

Comment by Peter Jones [ 22/Jun/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:11:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.