[LU-15478] Regression in 005bd7075c LU-10391 lnet: Change lnet_send() to take large-addr nids Created: 24/Jan/22 Updated: 07/Jan/24 Resolved: 31/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | IPv6 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Routed, source-any sends were broken by https://review.whamcloud.com/43599 lnet_handle_find_routed_path() calls lnet_find_route_locked() passing LNET_NID_NET(src_nid) as an argument. best_route = lnet_find_route_locked(best_rnet,
LNET_NID_NET(src_nid),
sd->sd_best_lpni,
&last_route, &gwni);
This network ID is in turn passed to lnet_find_best_lpni() where it is compared against LNET_NET_ANY: static inline struct lnet_peer_ni *
lnet_find_best_lpni(struct lnet_ni *lni, lnet_nid_t dst_nid,
struct lnet_peer *peer, __u32 net_id)
{
struct lnet_peer_net *peer_net;
/* find the best_lpni on any local network */
if (net_id == LNET_NET_ANY) {
Where #define LNET_NET_ANY LNET_NIDNET(LNET_NID_ANY) == LNET_NIDNET(-1) == 0xffffffff In the case where a source NID was not specified, the network id passed to lnet_find_best_lpni() is equal to LNET_NID_NET(LNET_ANY_NID) Where: static inline __u32 LNET_NID_NET(const struct lnet_nid *nid)
{
return LNET_MKNET(nid->nid_type, __be16_to_cpu(nid->nid_num));
}
I think we need an "extended nid" version of LNET_NET_ANY (call it LNET_ANY_NET) such that #define LNET_ANY_NET LNET_NID_NET(&LNET_ANY_NID) or LNET_NID_NET could be modified to check for LNET_ANY_NID and return LNET_NET_ANY. |
| Comments |
| Comment by Gerrit Updater [ 24/Jan/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46292 |
| Comment by Gerrit Updater [ 31/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46292/ |
| Comment by Peter Jones [ 31/Jan/22 ] |
|
Landed for 2.15 |
| Comment by Chris Horn [ 31/Jan/22 ] |
|
Test report for Build w/o the fix: [hornc@ct7-adm lustre-filesystem]$ git reset --hard 78be823f33 HEAD is now at 78be823f33 LU-15218 quota: delete unused quota ID [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... Show bug: [root@ct7-adm tests]# lctl list_nids 10.73.10.10@tcp [root@ct7-adm tests]# lctl show_route net tcp1 hops 4294967295 gw 10.73.10.11@tcp up pri 0 [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1 failed to ping 10.73.10.12@tcp1: Input/output error [root@ct7-adm tests]# dmesg | tail [11269.739430] Lustre: DEBUG MARKER: == sanity-lnet test complete, duration 7 sec ============= 01:17:25 (1643419045) [11270.753951] LNet: Removed LNI 10.73.10.10@tcp [11271.691683] LNet: Removed LNI 10.73.10.10@tcp1 [12797.050427] LNet: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 1 [12797.066287] alg: No test for adler32 (adler32-zlib) [12797.853703] Lustre: DEBUG MARKER: Starting lnet [12797.902831] LNet: Added LNI 10.73.10.10@tcp [8/256/0/180] [12797.903911] LNet: Accept secure, port 988 [12797.920874] LNet: 30464:0:(router.c:798:lnet_add_route()) Use hops = 1 for a single-hop route when avoid_asym_router_failure feature is enabled [12815.937019] LNetError: 30522:0:(lib-move.c:2285:lnet_handle_find_routed_path()) no route to 10.73.10.12@tcp1 from <?> [root@ct7-adm tests]# Apply fix: [hornc@ct7-adm lustre-filesystem]$ git reset --hard fbbc1258a0 HEAD is now at fbbc1258a0 LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... Now ping works as expected: [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1 12345-0@lo 12345-10.73.10.12@tcp1 [root@ct7-adm tests]# |