Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
EL9.3
-
3
-
9223372036854775807
Description
I was testing today's master branch to get Lustre servers running on EL9.3 for a new project and it looks like a regression was introduced as I cannot ping the local NID anymore:
Lustre 2.15.59_32 works as expected:
[root@elm-rcf-io2-s2 ~]# rpm -q lustre
lustre-2.15.59_32_g1bb972b-1.el9.x86_64
[root@elm-rcf-io2-s2 ~]# ibstat
CA 'mlx5_0'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1002
Hardware version: 0
Node GUID: 0x58a2e103003e5238
System image GUID: 0x58a2e103003e5238
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe3e5238
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1002
Hardware version: 0
Node GUID: 0x58a2e103003e5239
System image GUID: 0x58a2e103003e5238
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe3e5239
Link layer: Ethernet
[root@elm-rcf-io2-s2 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib9
local NI(s):
- nid: 10.4.0.24@o2ib9
status: up
interfaces:
0: ens2f0np0
[root@elm-rcf-io2-s2 ~]# lctl list_nids
10.4.0.24@o2ib9
[root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9
12345-0@lo
12345-10.4.0.24@o2ib9
[root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9
ping:
- primary nid: 10.4.0.24@o2ib9
Multi-Rail: False
peer ni:
- nid: 10.4.0.24@o2ib9
[root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo
ping:
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 10.4.0.24@o2ib9
But Lustre 2.15.61_225 is now broken:
[root@elm-rcf-io2-s2 ~]# rpm -q lustre
lustre-2.15.61_225_gbb6a2d2-1.el9.x86_64
[root@elm-rcf-io2-s2 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib9
local NI(s):
- nid: 10.4.0.24@o2ib9
status: up
interfaces:
0: ens2f0np0
[root@elm-rcf-io2-s2 ~]# ibstat
CA 'mlx5_0'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1002
Hardware version: 0
Node GUID: 0x58a2e103003e5238
System image GUID: 0x58a2e103003e5238
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe3e5238
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1002
Hardware version: 0
Node GUID: 0x58a2e103003e5239
System image GUID: 0x58a2e103003e5238
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe3e5239
Link layer: Ethernet
[root@elm-rcf-io2-s2 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib9
local NI(s):
- nid: 10.4.0.24@o2ib9
status: up
interfaces:
0: ens2f0np0
[root@elm-rcf-io2-s2 ~]# lctl list_nids
10.4.0.24@o2ib9
[root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9
failed to ping 10.4.0.24@o2ib9: Protocol error
[root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9
manage:
- ping:
errno: -71
descr: ! 'failed to ping 10.4.0.24@o2ib9: Protocol error'
[root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo
ping:
- primary nid: 0@lo
Multi-Rail: false
peer_ni:
- nid: 10.4.0.24@o2ib9
Lustre debug shows:
00000400:00020000:4.0:1712120315.285318:0:101829:0:(api-ni.c:9146:lnet_ping()) 12345-10.4.0.24@o2ib9: Unexpected magic 00000000