-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Duplicate
-
Affects Version/s: Lustre 2.10.4
-
Fix Version/s: None
-
Labels:None
-
Environment:CentOS 7.5, x86_64
-
Severity:3
-
Rank (Obsolete):9223372036854775807
It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).
Symptoms
No LNet communication, self-ping doesn't work:
# lctl list_nids 10.9.101.60@o2ib4 # lctl ping 10.9.101.60@o2ib4 failed to ping 10.9.101.60@o2ib4: Input/output error
Communicating with other nodes is impossible, as is mounting filesystems.
The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64
Versions
# uname -r 3.10.0-862.11.6.el7.x86_64 # cat /sys/fs/lustre/version 2.10.4
HW
# ibstat CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.21.3012 Hardware version: 0 Node GUID: 0x7cfe900300268c04 System image GUID: 0x7cfe900300268c04 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 72 LMC: 0 SM lid: 6 Capability mask: 0x2651e848 Port GUID: 0x7cfe900300268c04 Link layer: InfiniBand
Kernel logs
[ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error