Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
None
-
Lustre tagged 2.15.62
-
3
-
9223372036854775807
Description
We tried the latest tag 2.15.62 on top of EL9.3 today and LNet o2ib is broken. This is not related to LU-17700 which I tested separately.
This seems to be a regression introduced during that time:
9b2cb4e208 New tag 2.15.62 <- TESTED: lnet o2ib broken b53aa99294 LU-17690 quota: uninitialized var in qmt_lgd_extend_cb() fac57894c7 LU-17683 lnet: ksocknal_startup() leaks iface table 53c3585ac3 LU-17673 obdclass: properly free opts string 37e1316050 LU-17685 utils: Allow nocompr flag in lfs mirror extend b4d2566ea7 LU-930 ofd: improve orphan cleaning message 9f7b60739e LU-8191 selftest: restore BUILD_BUGs 02aa540250 LU-17678 quota: fix memleak in qmt_setup_lqe_gd() 552a813a22 LU-17673 llite: ll_options() to release the string 4ae823762d LU-17261 lov: unlink can handle bogus striping 90ec7361b7 LU-17665 lnet: lock primary NID only on lustre-built peer 94d05d0737 LU-17379 mgc: try MGS nodes faster 1ec1858f64 LU-16724 ptlrpc: refactor page pools patch 1 c660bdef1d LU-16694 tests: rewrite socket client, server in python 1e9a36b00c LU-15981 tests: add missing close in cascading_rw 7bdbe5d8e3 LU-17680 ldlm: fix ldlm_res_hop_hash() argument a9704ed2d8 LU-17675 tests: sanity-flr/61a set atime_diff=1 for statx 6b4ef5cab9 LU-17674 build: use nop_mnt_idmap in inode_owner_or_capable fa08092d9a LU-6142 llite: Fix style issues under lustre/llite 4c809f7621 LU-12452 o2iblnd: allow setting IP ToS value (RoCE) 56af81e1aa LU-10391 lnet: update Netlink commands functionality 7151881aa3 LU-13814 clio: remove cp_state usage for DIO pages 9ef186b71b LU-16692 tests: remove force_new_seq from some test suites f00d2467fc LU-16692 osp: do not assert on seq got over network bb6a2d2e80 LU-17053 libcfs: make a debugfs equivalent for markers <- TESTED: lnet o2ib loads
What happens with 2.15.62 is this (tested on ConnectX-6 RoCE):
[root@elm-rcf-md1-s2 ~]# cat /etc/lnet.conf # lnet.conf - configuration file for lnet routes to be imported by lnetctl net: - net type: o2ib9 local NI(s): - nid: interfaces: 0: eno12399np0 [root@elm-rcf-md1-s2 ~]# systemctl status lnet × lnet.service - lnet management Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Tue 2024-04-09 14:40:39 PDT; 12s ago Process: 3239 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS) Process: 3301 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, status=0/SUCCESS) Process: 3304 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, status=234) Main PID: 3304 (code=exited, status=234) CPU: 789ms Apr 09 14:40:38 elm-rcf-md1-s2 systemd[1]: Starting lnet management... Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: --- Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: add: Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: - import: Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: errno: -22 Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: descr: ! "unsupported NID" Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: ... Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: lnet.service: Main process exited, code=exited, status=234/n/a Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: lnet.service: Failed with result 'exit-code'. Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: Failed to start lnet management. [root@elm-rcf-md1-s2 ~]# ibstat CA 'mlx5_0' CA type: MT4127 Number of ports: 1 Firmware version: 26.38.1002 Hardware version: 0 Node GUID: 0x946dae03006d972a System image GUID: 0x946dae03006d972a Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x966daefffe6d972a Link layer: Ethernet