Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
EL9.3
-
3
-
9223372036854775807
Description
I was testing today's master branch to get Lustre servers running on EL9.3 for a new project and it looks like a regression was introduced as I cannot ping the local NID anymore:
Lustre 2.15.59_32 works as expected:
[root@elm-rcf-io2-s2 ~]# rpm -q lustre lustre-2.15.59_32_g1bb972b-1.el9.x86_64 [root@elm-rcf-io2-s2 ~]# ibstat CA 'mlx5_0' CA type: MT4125 Number of ports: 1 Firmware version: 22.38.1002 Hardware version: 0 Node GUID: 0x58a2e103003e5238 System image GUID: 0x58a2e103003e5238 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x5aa2e1fffe3e5238 Link layer: Ethernet CA 'mlx5_1' CA type: MT4125 Number of ports: 1 Firmware version: 22.38.1002 Hardware version: 0 Node GUID: 0x58a2e103003e5239 System image GUID: 0x58a2e103003e5238 Port 1: State: Down Physical state: Disabled Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x5aa2e1fffe3e5239 Link layer: Ethernet [root@elm-rcf-io2-s2 ~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib9 local NI(s): - nid: 10.4.0.24@o2ib9 status: up interfaces: 0: ens2f0np0 [root@elm-rcf-io2-s2 ~]# lctl list_nids 10.4.0.24@o2ib9 [root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9 12345-0@lo 12345-10.4.0.24@o2ib9 [root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9 ping: - primary nid: 10.4.0.24@o2ib9 Multi-Rail: False peer ni: - nid: 10.4.0.24@o2ib9 [root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo ping: - primary nid: 0@lo Multi-Rail: False peer ni: - nid: 10.4.0.24@o2ib9
But Lustre 2.15.61_225 is now broken:
[root@elm-rcf-io2-s2 ~]# rpm -q lustre lustre-2.15.61_225_gbb6a2d2-1.el9.x86_64 [root@elm-rcf-io2-s2 ~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib9 local NI(s): - nid: 10.4.0.24@o2ib9 status: up interfaces: 0: ens2f0np0 [root@elm-rcf-io2-s2 ~]# ibstat CA 'mlx5_0' CA type: MT4125 Number of ports: 1 Firmware version: 22.38.1002 Hardware version: 0 Node GUID: 0x58a2e103003e5238 System image GUID: 0x58a2e103003e5238 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x5aa2e1fffe3e5238 Link layer: Ethernet CA 'mlx5_1' CA type: MT4125 Number of ports: 1 Firmware version: 22.38.1002 Hardware version: 0 Node GUID: 0x58a2e103003e5239 System image GUID: 0x58a2e103003e5238 Port 1: State: Down Physical state: Disabled Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x5aa2e1fffe3e5239 Link layer: Ethernet [root@elm-rcf-io2-s2 ~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib9 local NI(s): - nid: 10.4.0.24@o2ib9 status: up interfaces: 0: ens2f0np0 [root@elm-rcf-io2-s2 ~]# lctl list_nids 10.4.0.24@o2ib9 [root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9 failed to ping 10.4.0.24@o2ib9: Protocol error [root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9 manage: - ping: errno: -71 descr: ! 'failed to ping 10.4.0.24@o2ib9: Protocol error' [root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo ping: - primary nid: 0@lo Multi-Rail: false peer_ni: - nid: 10.4.0.24@o2ib9
Lustre debug shows:
00000400:00020000:4.0:1712120315.285318:0:101829:0:(api-ni.c:9146:lnet_ping()) 12345-10.4.0.24@o2ib9: Unexpected magic 00000000