[LU-12411] Hang on lnetctl route del Created: 10/Jun/19  Updated: 24/Apr/20  Resolved: 25/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-12411.dump.tar.bz2     Text File dmesg.txt    
Issue Links:
Related
is related to LU-12595 Attempt to add route using non-local ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hit with master.

Steps to reproduce:

sles15build01:/home/hornc/lustre-filesystem # insmod ./libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod ./lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl export
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3]"
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net gni4 --gateway 10.12.0.[1-4]@o2ib40
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl export
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1,2,3]"
route:
    - net: gni4
      gateway: 10.12.0.4@o2ib40
      hop: -1
      priority: 0
      health_sensitivity: 1
      state: down
    - net: gni4
      gateway: 10.12.0.3@o2ib40
      hop: -1
      priority: 0
      health_sensitivity: 1
      state: down
    - net: gni4
      gateway: 10.12.0.2@o2ib40
      hop: -1
      priority: 0
      health_sensitivity: 1
      state: down
    - net: gni4
      gateway: 10.12.0.1@o2ib40
      hop: -1
      priority: 0
      health_sensitivity: 1
      state: down
peer:
    - primary nid: 10.12.0.1@o2ib40
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.1@o2ib40
          state: up
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
    - primary nid: 10.12.0.2@o2ib40
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.2@o2ib40
          state: up
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
    - primary nid: 10.12.0.4@o2ib40
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.4@o2ib40
          state: up
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
    - primary nid: 10.12.0.3@o2ib40
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.3@o2ib40
          state: up
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route del --gateway 10.12.0.[1-4]@o2ib4 --net gni4
^^^^ Command hangs


 Comments   
Comment by Chris Horn [ 11/Jun/19 ]

I hit this issue on a route add and was able to grab the dmesg from the node.

sles15build01:/home/hornc/lustre-filesystem # insmod libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gw 485@gni4
route add: add a route
	--net: net name (e.g. tcp0)
	--gateway: gateway nid (e.g. 10.1.1.2@tcp)
	--hop: number to final destination (1 < hops < 255)
	--priority: priority of route (0 - highest prio
	--health_sensitivity: gateway health sensitivity (>= 1)

sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # lnetctl route show
route:
    - net: o2ib40
      gateway: 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
    - net: o2ib40
      gateway: 485@gni4
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl peer show
peer:
    - primary nid: 485@gni4
      Multi-Rail: True
      peer ni:
        - nid: 485@gni4
          state: up
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway 486@gni4,487@gni4,488@gni,10.12.0.[1-4]@o2ib,192.168.0.[20-24]@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
    - net: o2ib40
      gateway: 10.12.0.1@o2ib
    - net: o2ib40
      gateway: 488@gni
    - net: o2ib40
      gateway: 487@gni4
    - net: o2ib40
      gateway: 192.168.0.23@tcp
    - net: o2ib40
      gateway: 192.168.0.21@tcp
    - net: o2ib40
      gateway: 192.168.0.20@tcp
    - net: o2ib40
      gateway: 192.168.0.24@tcp
    - net: o2ib40
      gateway: 10.12.0.2@o2ib
    - net: o2ib40
      gateway: 486@gni4
    - net: o2ib40
      gateway: 192.168.0.22@tcp
    - net: o2ib40
      gateway: 485@gni4
    - net: o2ib40
      gateway: 10.12.0.3@o2ib
    - net: o2ib40
      gateway: 10.12.0.4@o2ib
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl peer show
peer:
    - primary nid: 488@gni
      Multi-Rail: True
      peer ni:
        - nid: 488@gni
          state: up
    - primary nid: 10.12.0.1@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.1@o2ib
          state: up
    - primary nid: 10.12.0.4@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.4@o2ib
          state: up
    - primary nid: 192.168.0.21@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.0.21@tcp
          state: up
    - primary nid: 192.168.0.24@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.0.24@tcp
          state: up
    - primary nid: 486@gni4
      Multi-Rail: True
      peer ni:
        - nid: 486@gni4
          state: up
    - primary nid: 10.12.0.2@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.2@o2ib
          state: up
    - primary nid: 192.168.0.22@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.0.22@tcp
          state: up
    - primary nid: 487@gni4
      Multi-Rail: True
      peer ni:
        - nid: 487@gni4
          state: up
    - primary nid: 192.168.0.20@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.0.20@tcp
          state: up
    - primary nid: 485@gni4
      Multi-Rail: True
      peer ni:
        - nid: 485@gni4
          state: up
    - primary nid: 10.12.0.3@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.12.0.3@o2ib
          state: up
    - primary nid: 192.168.0.23@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.0.23@tcp
          state: up
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway [486-489]@gni4
add:
    - route:
          errno: -1
          descr: "Cannot parse nid: [486-489]@gni4"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib40 --gateway [485,486,93,94]@gni4
add:
    - route:
          errno: -1
          descr: "Cannot parse nid: [485,486,93,94]@gni4"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
^^^ Command hangs
Comment by Chris Horn [ 11/Jun/19 ]

FWIW, this is the module options on this node:

sles15build01:~ # cat /etc/modprobe.d/lnet.conf
options lnet forwarding=enabled

options lnet networks="tcp(eth0)"

sles15build01:~ #
Comment by Chris Horn [ 11/Jun/19 ]

Attached a dump from the route add case mentioned above https://jira.whamcloud.com/secure/attachment/32770/LU-12411.dump.tar.bz2

Comment by Chris Horn [ 11/Jun/19 ]

Might be related to the fact that I wasn't configuring any local interfaces before adding the routes? I see another hang doing the following:

sles15build01:/home/hornc/lustre-filesystem # insmod libcfs/libcfs/libcfs.ko
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko^C
sles15build01:/home/hornc/lustre-filesystem # vim /etc/modprobe.d/lnet.conf
sles15build01:/home/hornc/lustre-filesystem # insmod lnet/lnet/lnet.ko
sles15build01:/home/hornc/lustre-filesystem # cd lnet/utils/
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl lnet configure
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route add --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route del --net o2ib --gateway 192.168.2.24@tcp,192.168.2.25@tcp
del:
    - route:
          errno: -8
          descr: "Success"
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl route show
route:
    - net: o2ib
      gateway: 192.168.2.25@tcp
    - net: o2ib
      gateway: 192.168.2.24@tcp
sles15build01:/home/hornc/lustre-filesystem/lnet/utils # ./lnetctl net add --net tcp --if eth0
^^^ Command hangs
Comment by Chris Horn [ 11/Jun/19 ]

This might also be a bug exposed when adding any route where the gateway is on an unreachable lnet. We probably shouldn't allow that to happen. I will push a patch to prevent this.

Comment by Gerrit Updater [ 11/Jun/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35198
Subject: LU-12411 lnet: Do not allow gateways on remote nets
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 203acca356f3793c1f1c469af1b83790c814c9ad

Comment by Amir Shehata (Inactive) [ 12/Jun/19 ]

Steps to reproduce

[root@lustre01 ~]# modprobe lnet
[root@lustre01 ~]# lnetctl lnet configure
[root@lustre01 ~]# lnetctl route add --net tcp1 --gateway 192.168.122.[106-107]@tcp
[root@lustre01 ~]# lnetctl route del --net tcp1 --gateway 192.168.122.[106-107]@tcp
Comment by Gerrit Updater [ 25/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35198/
Subject: LU-12411 lnet: Do not allow gateways on remote nets
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43b35351e9ca258773e89c2d68047e939fb822fb

Comment by Peter Jones [ 25/Jun/19 ]

Landed for 2.13

Comment by Chris Horn [ 25/Jun/19 ]

Amir, Is there another bug here to chase, or is it sufficient to just prevent the behavior that leads to breakage?

Comment by Chris Horn [ 26/Jul/19 ]

pjones FYI, I discovered that this is a regression that was originally introduced in 2.10.0

commit 376633ab5c487a2e9497e118ce351c4b1597bf33
Author: Amir Shehata <amir.shehata@intel.com>
Date:   Mon Jul 4 14:51:06 2016 -0700
 
    LU-7734 lnet: Routing fixes part 1

Unfortunately, my fix that landed under this ticket contained a flaw. I pushed a patch for that under LU-12595. Assuming LU-12595 lands, you might consider landing both of these patches to LTS.

Comment by Peter Jones [ 26/Jul/19 ]

ok thanks. I've flagged the second one for the LTS so we'll pick it up when that lands to master.

Comment by Gerrit Updater [ 26/Nov/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36870
Subject: LU-12411 lnet: Do not allow gateways on remote nets
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: cb829cc7683e66ecb95db7a80e925d716b6560e9

Comment by Gerrit Updater [ 12/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36870/
Subject: LU-12411 lnet: Do not allow gateways on remote nets
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c6c9084c959ac972af557da100f251eccc79d2f7

Generated at Sat Feb 10 02:52:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.