[LU-12411] Hang on lnetctl route del Created: 10/Jun/19 Updated: 24/Apr/20 Resolved: 25/Jun/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Hit with master. Steps to reproduce: sles15build01:/home/hornc/lustre-filesystem # insmod ./libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod ./lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl export
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 0
interrupts: 0
dropped: 0
aborted: 0
no route: 0
timeouts: 0
error: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1,2,3]"
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net gni4 --gateway 10.12.0.[1-4]@o2ib40
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl export
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 0
interrupts: 0
dropped: 0
aborted: 0
no route: 0
timeouts: 0
error: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1,2,3]"
route:
- net: gni4
gateway: 10.12.0.4@o2ib40
hop: -1
priority: 0
health_sensitivity: 1
state: down
- net: gni4
gateway: 10.12.0.3@o2ib40
hop: -1
priority: 0
health_sensitivity: 1
state: down
- net: gni4
gateway: 10.12.0.2@o2ib40
hop: -1
priority: 0
health_sensitivity: 1
state: down
- net: gni4
gateway: 10.12.0.1@o2ib40
hop: -1
priority: 0
health_sensitivity: 1
state: down
peer:
- primary nid: 10.12.0.1@o2ib40
Multi-Rail: True
peer ni:
- nid: 10.12.0.1@o2ib40
state: up
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
- primary nid: 10.12.0.2@o2ib40
Multi-Rail: True
peer ni:
- nid: 10.12.0.2@o2ib40
state: up
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
- primary nid: 10.12.0.4@o2ib40
Multi-Rail: True
peer ni:
- nid: 10.12.0.4@o2ib40
state: up
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
- primary nid: 10.12.0.3@o2ib40
Multi-Rail: True
peer ni:
- nid: 10.12.0.3@o2ib40
state: up
max_ni_tx_credits: 0
available_tx_credits: 0
min_tx_credits: 0
tx_q_num_of_buf: 0
available_rtr_credits: 0
min_rtr_credits: 0
refcount: 2
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route del --gateway 10.12.0.[1-4]@o2ib4 --net gni4
^^^^ Command hangs
|
| Comments |
| Comment by Chris Horn [ 11/Jun/19 ] |
|
I hit this issue on a route add and was able to grab the dmesg from the node. sles15build01:/home/hornc/lustre-filesystem # insmod libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gw 485@gni4
route add: add a route
--net: net name (e.g. tcp0)
--gateway: gateway nid (e.g. 10.1.1.2@tcp)
--hop: number to final destination (1 < hops < 255)
--priority: priority of route (0 - highest prio
--health_sensitivity: gateway health sensitivity (>= 1)
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # lnetctl route show
route:
- net: o2ib40
gateway: 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
- net: o2ib40
gateway: 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl peer show
peer:
- primary nid: 485@gni4
Multi-Rail: True
peer ni:
- nid: 485@gni4
state: up
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway 486@gni4,487@gni4,488@gni,10.12.0.[1-4]@o2ib,192.168.0.[20-24]@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
- net: o2ib40
gateway: 10.12.0.1@o2ib
- net: o2ib40
gateway: 488@gni
- net: o2ib40
gateway: 487@gni4
- net: o2ib40
gateway: 192.168.0.23@tcp
- net: o2ib40
gateway: 192.168.0.21@tcp
- net: o2ib40
gateway: 192.168.0.20@tcp
- net: o2ib40
gateway: 192.168.0.24@tcp
- net: o2ib40
gateway: 10.12.0.2@o2ib
- net: o2ib40
gateway: 486@gni4
- net: o2ib40
gateway: 192.168.0.22@tcp
- net: o2ib40
gateway: 485@gni4
- net: o2ib40
gateway: 10.12.0.3@o2ib
- net: o2ib40
gateway: 10.12.0.4@o2ib
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl peer show
peer:
- primary nid: 488@gni
Multi-Rail: True
peer ni:
- nid: 488@gni
state: up
- primary nid: 10.12.0.1@o2ib
Multi-Rail: True
peer ni:
- nid: 10.12.0.1@o2ib
state: up
- primary nid: 10.12.0.4@o2ib
Multi-Rail: True
peer ni:
- nid: 10.12.0.4@o2ib
state: up
- primary nid: 192.168.0.21@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.0.21@tcp
state: up
- primary nid: 192.168.0.24@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.0.24@tcp
state: up
- primary nid: 486@gni4
Multi-Rail: True
peer ni:
- nid: 486@gni4
state: up
- primary nid: 10.12.0.2@o2ib
Multi-Rail: True
peer ni:
- nid: 10.12.0.2@o2ib
state: up
- primary nid: 192.168.0.22@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.0.22@tcp
state: up
- primary nid: 487@gni4
Multi-Rail: True
peer ni:
- nid: 487@gni4
state: up
- primary nid: 192.168.0.20@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.0.20@tcp
state: up
- primary nid: 485@gni4
Multi-Rail: True
peer ni:
- nid: 485@gni4
state: up
- primary nid: 10.12.0.3@o2ib
Multi-Rail: True
peer ni:
- nid: 10.12.0.3@o2ib
state: up
- primary nid: 192.168.0.23@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.0.23@tcp
state: up
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway [486-489]@gni4
add:
- route:
errno: -1
descr: "Cannot parse nid: [486-489]@gni4"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway [485,486,93,94]@gni4
add:
- route:
errno: -1
descr: "Cannot parse nid: [485,486,93,94]@gni4"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
^^^ Command hangs
|
| Comment by Chris Horn [ 11/Jun/19 ] |
|
FWIW, this is the module options on this node: sles15build01:~ # cat /etc/modprobe.d/lnet.conf options lnet forwarding=enabled options lnet networks="tcp(eth0)" sles15build01:~ # |
| Comment by Chris Horn [ 11/Jun/19 ] |
|
Attached a dump from the route add case mentioned above https://jira.whamcloud.com/secure/attachment/32770/LU-12411.dump.tar.bz2 |
| Comment by Chris Horn [ 11/Jun/19 ] |
|
Might be related to the fact that I wasn't configuring any local interfaces before adding the routes? I see another hang doing the following: sles15build01:/home/hornc/lustre-filesystem # insmod libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko^C
sles15build01:/home/hornc/lustre-filesystem # vim /etc/modprobe.d/lnet.conf
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route del --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
del:
- route:
errno: -8
descr: "Success"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
- net: o2ib
gateway: 192.168.2.25@tcp
- net: o2ib
gateway: 192.168.2.24@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl net add --net tcp --if eth0
^^^ Command hangs
|
| Comment by Chris Horn [ 11/Jun/19 ] |
|
This might also be a bug exposed when adding any route where the gateway is on an unreachable lnet. We probably shouldn't allow that to happen. I will push a patch to prevent this. |
| Comment by Gerrit Updater [ 11/Jun/19 ] |
|
Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35198 |
| Comment by Amir Shehata (Inactive) [ 12/Jun/19 ] |
|
Steps to reproduce [root@lustre01 ~]# modprobe lnet [root@lustre01 ~]# lnetctl lnet configure [root@lustre01 ~]# lnetctl route add --net tcp1 --gateway 192.168.122.[106-107]@tcp [root@lustre01 ~]# lnetctl route del --net tcp1 --gateway 192.168.122.[106-107]@tcp |
| Comment by Gerrit Updater [ 25/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35198/ |
| Comment by Peter Jones [ 25/Jun/19 ] |
|
Landed for 2.13 |
| Comment by Chris Horn [ 25/Jun/19 ] |
|
Amir, Is there another bug here to chase, or is it sufficient to just prevent the behavior that leads to breakage? |
| Comment by Chris Horn [ 26/Jul/19 ] |
|
pjones FYI, I discovered that this is a regression that was originally introduced in 2.10.0 commit 376633ab5c487a2e9497e118ce351c4b1597bf33
Author: Amir Shehata <amir.shehata@intel.com>
Date: Mon Jul 4 14:51:06 2016 -0700
LU-7734 lnet: Routing fixes part 1
Unfortunately, my fix that landed under this ticket contained a flaw. I pushed a patch for that under |
| Comment by Peter Jones [ 26/Jul/19 ] |
|
ok thanks. I've flagged the second one for the LTS so we'll pick it up when that lands to master. |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36870 |
| Comment by Gerrit Updater [ 12/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36870/ |