[LU-15622] LNET/o2ib doesn't know LNET state properly in RoCE configuration Created: 07/Mar/22  Updated: 24/May/23  Resolved: 24/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-14790 Local NI status should reflect whethe... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

here is RoCE setup

[root@es200nvx-vm1 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.30.1004
	Hardware version: 0
	Node GUID: 0x0c42a10300ae2a4e
	System image GUID: 0x0c42a10300ae2a4e
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0e42a1fffeae2a4e
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.30.1004
	Hardware version: 0
	Node GUID: 0x0c42a10300ae2a4f
	System image GUID: 0x0c42a10300ae2a4e
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0e42a1fffeae2a4f
		Link layer: Ethernet

[root@es200nvx-vm1 ~]# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:00:70:ea:1e:d1 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: ens1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:ae:2a:4e brd ff:ff:ff:ff:ff:ff
5: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:ae:2a:4f brd ff:ff:ff:ff:ff:ff
[root@es200nvx-vm1 ~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib12
      local NI(s):
        - nid: 192.168.11.232@o2ib12
          status: up
          interfaces:
              0: ens1
        - nid: 192.168.11.242@o2ib12
          status: up
          interfaces:

Turn one network interface off physically.

[root@es200nvx-vm1 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.30.1004
	Hardware version: 0
	Node GUID: 0x0c42a10300ae2a4e
	System image GUID: 0x0c42a10300ae2a4e
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0e42a1fffeae2a4e
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.30.1004
	Hardware version: 0
	Node GUID: 0x0c42a10300ae2a4f
	System image GUID: 0x0c42a10300ae2a4e
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0e42a1fffeae2a4f
		Link layer: Ethernet

[root@es200nvx-vm1 ~]# ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:00:70:ea:1e:d1 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: ens1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:ae:2a:4e brd ff:ff:ff:ff:ff:ff
5: ens2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:ae:2a:4f brd ff:ff:ff:ff:ff:ff
NID status never changed "down" from "up" properly even if physical network interface downed.

However, LNET state is up.

[root@es200nvx-vm1 ~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib12
      local NI(s):
        - nid: 192.168.11.232@o2ib12
          status: up
          interfaces:
              0: ens1
        - nid: 192.168.11.242@o2ib12
          status: up
          interfaces:
              0: ens2


 Comments   
Comment by Chris Horn [ 07/Mar/22 ]

What Lustre version is being tested? Could this be a dupe of https://jira.whamcloud.com/browse/LU-14790 ?

Comment by Cyril Bordage [ 15/Mar/22 ]

I checked on a RoCE setup that showed the error. Indeed, it is a dupe of LU-14790.

Comment by Cyril Bordage [ 26/May/22 ]

Shuichi, patch has been ported. Is it okay?

Generated at Sat Feb 10 03:19:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.