Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.10.4
-
None
-
CentOS 7.5, x86_64
-
3
-
9223372036854775807
Description
It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).
Symptoms
No LNet communication, self-ping doesn't work:
# lctl list_nids 10.9.101.60@o2ib4 # lctl ping 10.9.101.60@o2ib4 failed to ping 10.9.101.60@o2ib4: Input/output error
Communicating with other nodes is impossible, as is mounting filesystems.
The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64
Versions
# uname -r 3.10.0-862.11.6.el7.x86_64 # cat /sys/fs/lustre/version 2.10.4
HW
# ibstat CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.21.3012 Hardware version: 0 Node GUID: 0x7cfe900300268c04 System image GUID: 0x7cfe900300268c04 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 72 LMC: 0 SM lid: 6 Capability mask: 0x2651e848 Port GUID: 0x7cfe900300268c04 Link layer: InfiniBand
Kernel logs
[ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error