[LU-15478] Regression in 005bd7075c LU-10391 lnet: Change lnet_send() to take large-addr nids Created: 24/Jan/22  Updated: 07/Jan/24  Resolved: 31/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: IPv6

Issue Links:
Related
is related to LU-10391 LNET: Support IPv6 Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Routed, source-any sends were broken by https://review.whamcloud.com/43599

lnet_handle_find_routed_path() calls lnet_find_route_locked() passing LNET_NID_NET(src_nid) as an argument.

                best_route = lnet_find_route_locked(best_rnet,
                                                    LNET_NID_NET(src_nid),
                                                    sd->sd_best_lpni,
                                                    &last_route, &gwni);

This network ID is in turn passed to lnet_find_best_lpni() where it is compared against LNET_NET_ANY:

static inline struct lnet_peer_ni *
lnet_find_best_lpni(struct lnet_ni *lni, lnet_nid_t dst_nid,
                    struct lnet_peer *peer, __u32 net_id)
{
        struct lnet_peer_net *peer_net;

        /* find the best_lpni on any local network */
        if (net_id == LNET_NET_ANY) {

Where

#define LNET_NET_ANY LNET_NIDNET(LNET_NID_ANY)
 == LNET_NIDNET(-1)
 == 0xffffffff

In the case where a source NID was not specified, the network id passed to lnet_find_best_lpni() is equal to

LNET_NID_NET(LNET_ANY_NID)

Where:

static inline __u32 LNET_NID_NET(const struct lnet_nid *nid)
{
        return LNET_MKNET(nid->nid_type, __be16_to_cpu(nid->nid_num));
}

I think we need an "extended nid" version of LNET_NET_ANY (call it LNET_ANY_NET) such that

#define LNET_ANY_NET LNET_NID_NET(&LNET_ANY_NID)

or LNET_NID_NET could be modified to check for LNET_ANY_NID and return LNET_NET_ANY.



 Comments   
Comment by Gerrit Updater [ 24/Jan/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46292
Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e21df0eaaf29326e51a2dc5dfccff1689adb9e1

Comment by Gerrit Updater [ 31/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46292/
Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fbbc1258a057ff718dd9ba41dc32faf2aadc3a90

Comment by Peter Jones [ 31/Jan/22 ]

Landed for 2.15

Comment by Chris Horn [ 31/Jan/22 ]

Test report for LU-15478

Build w/o the fix:

[hornc@ct7-adm lustre-filesystem]$ git reset --hard 78be823f33
HEAD is now at 78be823f33 LU-15218 quota: delete unused quota ID
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...

Show bug:

[root@ct7-adm tests]# lctl list_nids
10.73.10.10@tcp
[root@ct7-adm tests]# lctl show_route
net               tcp1 hops 4294967295 gw                  10.73.10.11@tcp up pri 0
[root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1
failed to ping 10.73.10.12@tcp1: Input/output error
[root@ct7-adm tests]# dmesg | tail
[11269.739430] Lustre: DEBUG MARKER: == sanity-lnet test complete, duration 7 sec ============= 01:17:25 (1643419045)
[11270.753951] LNet: Removed LNI 10.73.10.10@tcp
[11271.691683] LNet: Removed LNI 10.73.10.10@tcp1
[12797.050427] LNet: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 1
[12797.066287] alg: No test for adler32 (adler32-zlib)
[12797.853703] Lustre: DEBUG MARKER: Starting lnet
[12797.902831] LNet: Added LNI 10.73.10.10@tcp [8/256/0/180]
[12797.903911] LNet: Accept secure, port 988
[12797.920874] LNet: 30464:0:(router.c:798:lnet_add_route()) Use hops = 1 for a single-hop route when avoid_asym_router_failure feature is enabled
[12815.937019] LNetError: 30522:0:(lib-move.c:2285:lnet_handle_find_routed_path()) no route to 10.73.10.12@tcp1 from <?>
[root@ct7-adm tests]#

Apply fix:

[hornc@ct7-adm lustre-filesystem]$ git reset --hard fbbc1258a0
HEAD is now at fbbc1258a0 LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...

Now ping works as expected:

[root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1
12345-0@lo
12345-10.73.10.12@tcp1
[root@ct7-adm tests]#
Generated at Sat Feb 10 03:18:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.