[LU-16836] LNet: initial ni status is "up" if starting with link disconnected Created: 18/May/23  Updated: 29/Oct/23  Resolved: 28/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: ksocklnd, lnet, o2iblnd

Issue Links:
Related
is related to LU-17235 kernel panic on kiblnd_startup with l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Adding NI if the corresponding interface is disconnected results in NI listed as "up" while it is expected to be "down". For example:

7: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:a3:1a:44 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.239/24 brd 192.168.122.255 scope global noprefixroute dynamic bond0
       valid_lft 3264sec preferred_lft 3264sec
    inet6 fe80::5054:ff:fea3:1a44/64 scope link 
       valid_lft forever preferred_lft forever
# lnetctl net add --net tcp --if bond0
# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.239@tcp
          status: up
          interfaces:
              0: bond0

 



 Comments   
Comment by Serguei Smirnov [ 19/May/23 ]

The issue here is that LNet may be able to initialize the LND, but doesn't get the link status update so it is not notified that the link is down. The new NI is then stuck with the "up" status until an event occurs which refreshes it, e.g plugging/unplugging the cable, etc.

Here are the cases when this can happen:

  • Cable is disconnected from the interface (socklnd, o2iblnd)
  • All slave links of a bonded interface are down (the example from the ticket description). The links may be "administered" to be down, so the bond interface will be considered down, but still retain the IP and thus enable the LND to be initialized.

Except with the slave bonded links as shown above, "ifdown" and "ip link set dev <if> down" commands don't appear to be useful for simulating this issue.

Comment by Gerrit Updater [ 19/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51057
Subject: LU-16836 lnet: ensure dev notification on lnd startup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fcb6d8de5eb5519d3eafd123719427c21ba0d6da

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51057/
Subject: LU-16836 lnet: ensure dev notification on lnd startup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 09c6e2b872287c847d15620788f6cf50b3a9f30b

Comment by Peter Jones [ 28/Jun/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:30:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.