[LU-11143] Multi-Rail/Dynamic Discovery break LNet router checker and asymmetric route failure detection Created: 11/Jul/18 Updated: 27/Jan/23 Resolved: 27/Jan/23 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The LNet router checker needs to ping the interface defined in the route table, but MR can choose a different interface for those pings. Reproduced on a three node VM. sles15build01:/tmp # lctl list_nids
192.168.2.20@tcp99
192.168.2.20@tcp1
192.168.2.20@tcp
sles15build01:~ # lnetctl route show -v
sles15build01:~ # lnetctl peer show -v
peer:
- primary nid: 192.168.2.22@tcp99
Multi-Rail: True
peer ni:
- nid: 192.168.2.22@tcp1
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 7
refcount: 1
statistics:
send_count: 22
recv_count: 22
drop_count: 0
- nid: 192.168.2.22@tcp99
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 7
refcount: 1
statistics:
send_count: 18
recv_count: 18
drop_count: 0
- primary nid: 192.168.2.21@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.21@tcp
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 7
refcount: 1
statistics:
send_count: 41
recv_count: 41
drop_count: 0
sles15build01:~ #
Client: sles15c01:/tmp # lctl list_nids
192.168.2.22@tcp99
192.168.2.22@tcp1
sles15c01:/tmp # lctl show_route
net tcp hops 4294967295 gw 192.168.2.20@tcp1 up pri 0
sles15c01:/tmp #
sles15c01:~ # lnetctl route show -v
route:
- net: tcp
gateway: 192.168.2.20@tcp1
hop: -1
priority: 0
state: up
sles15c01:~ # lnetctl peer show -v
peer:
- primary nid: 192.168.2.20@tcp99
Multi-Rail: True
peer ni:
- nid: 192.168.2.20@tcp1
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 4
statistics:
send_count: 23
recv_count: 23
drop_count: 0
- nid: 192.168.2.20@tcp99
state: NA
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 1
statistics:
send_count: 18
recv_count: 18
drop_count: 0
- nid: 192.168.2.20@tcp
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
- primary nid: 192.168.2.21@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.21@tcp
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sles15c01:~ #
Server: sles15s01:/tmp # lctl list_nids
192.168.2.21@tcp
sles15s01:/tmp # lctl show_route
net tcp1 hops 4294967295 gw 192.168.2.20@tcp up pri 0
net tcp99 hops 4294967295 gw 192.168.2.20@tcp down pri 0
sles15s01:~ # lnetctl route show -v
route:
- net: tcp1
gateway: 192.168.2.20@tcp
hop: -1
priority: 0
state: up
- net: tcp99
gateway: 192.168.2.20@tcp
hop: -1
priority: 0
state: up
sles15s01:~ # lnetctl peer show -v
peer:
- primary nid: 192.168.2.20@tcp99
Multi-Rail: True
peer ni:
- nid: 192.168.2.20@tcp
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: 7
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: 8
refcount: 5
statistics:
send_count: 42
recv_count: 42
drop_count: 0
- nid: 192.168.2.20@tcp99
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
- nid: 192.168.2.20@tcp1
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
- primary nid: 192.168.2.22@tcp99
Multi-Rail: True
peer ni:
- nid: 192.168.2.22@tcp99
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
- nid: 192.168.2.22@tcp1
state: NA
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sles15s01:~ #
Here we see the router checker thread on the client needs to ping the @tcp1 nid but lnet_select_pathway() chooses a different nid for the router. 00000400:00000200:2.0:1531326577.044094:0:29632:0:(router.c:1099:lnet_ping_router_locked()) Check: 12345-192.168.2.20@tcp1 00000400:00000200:2.0:1531326577.044100:0:29632:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.20@tcp1 00000400:00000200:2.0:1531326577.044213:0:29632:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99:<?>) -> 192.168.2.20@tcp99(192.168.2.20@tcp1:192.168.2.20@tcp99) : GET |
| Comments |
| Comment by Peter Jones [ 08/Aug/18 ] |
|
Sonia Any comment? Peter |
| Comment by Amir Shehata (Inactive) [ 08/Aug/18 ] |
|
both I address them here: https://wiki.whamcloud.com/display/LNet/Routing+and+MR+integration Might be a good idea to use that link for feedback on the proposals |
| Comment by Cory Spitz [ 26/Jul/19 ] |
| Comment by Amir Shehata (Inactive) [ 26/Jul/19 ] |
|
I believe this issue has been resolved in the new routing code. |
| Comment by Cory Spitz [ 26/Aug/19 ] |
|
ashehata, will you be resolving this issue then? Can you point at a specific commit or LU that resolved it? |
| Comment by Chris Horn [ 27/Jan/23 ] |
|
Resolved with the MR routing feature |