[LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet Created: 16/Aug/18  Updated: 23/Aug/19  Resolved: 23/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stanford Research Computing Center Assignee: Peter Jones
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.5, x86_64


Issue Links:
Duplicate
is duplicated by LU-11261 ko2iblnd lnet not working after 3.10.... Resolved
Related
is related to LU-11253 kernel update [RHEL7.5 3.10.0-862.11.... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

Symptoms

No LNet communication, self-ping doesn't work:

# lctl list_nids
10.9.101.60@o2ib4
# lctl ping 10.9.101.60@o2ib4
failed to ping 10.9.101.60@o2ib4: Input/output error

Communicating with other nodes is impossible, as is mounting filesystems.
The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

 Versions

# uname -r
3.10.0-862.11.6.el7.x86_64
# cat /sys/fs/lustre/version
2.10.4

HW

 

# ibstat
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.21.3012
        Hardware version: 0
        Node GUID: 0x7cfe900300268c04
        System image GUID: 0x7cfe900300268c04
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 72
                LMC: 0
                SM lid: 6
                Capability mask: 0x2651e848
                Port GUID: 0x7cfe900300268c04
                Link layer: InfiniBand

 

Kernel logs

[ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
[ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
[ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error


 Comments   
Comment by Stanford Research Computing Center [ 16/Aug/18 ]

A more detailed changelog about that kernel is at https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.11.6.el7/x86_64/fd431d51/package-changelog, if that's of any help.

Comment by SC Admin (Inactive) [ 16/Aug/18 ]

we see the same thing on our OPA network.

ksocklnd reportedly seems ok with this kernel on our TCP networks (in VMs mostly), so I suspect it's ko2iblnd related.

below is syslog from
john98 # lctl ping warble@o2ib44

Aug 17 01:58:10 john98 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 36, npartitions: 2
Aug 17 01:58:10 john98 kernel: alg: No test for adler32 (adler32-zlib)
Aug 17 01:58:11 john98 kernel: Lustre: Lustre: Build Version: 2.10.4
Aug 17 01:58:11 john98 kernel: LNet: Using FMR for registration
Aug 17 01:58:11 john98 kernel: LNet: Added LNI 192.168.44.198@o2ib44 [128/2048/0/180]
Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) 192.168.44.198@o2ib44: REJECTED 28
Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) Skipped 3 previous similar messages
Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 192.168.44.198@o2ib44: -22
Aug 17 02:06:08 john98 kernel: LNet: 204:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject
Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 192.168.44.198@o2ib44 rejected: consumer defined fatal error

2.10.4 was dkms rebuilt for this kernel.

cheers,
robin

Comment by SC Admin (Inactive) [ 16/Aug/18 ]

actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA?

do they work for you?

IPoIB works ok.

cheers,
robin

Comment by Stanford Research Computing Center [ 16/Aug/18 ]

Good observation, indeed: perf tests such as ib_{read,send}_bw  and ibv_rc_pingpong fail with errors like:

Failed to modify QP to RTR
Couldn't connect to remote QP 

or

Failed to modify QP 386 to RTR
Unable to Connect the HCA's through the link 
Comment by Stanford Research Computing Center [ 16/Aug/18 ]

I submitted RHEL bug #1618452 to report the issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1618452

Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end

Comment by Peter Jones [ 16/Aug/18 ]

I think that you have to request for them to open it up.

Comment by Stanford Research Computing Center [ 16/Aug/18 ]

RHEL's reply:

https://bugzilla.redhat.com/show_bug.cgi?id=1618452

— Comment #3 from Don Dutile <ddutile@redhat.com> —
Already reported and being actively fixed.

Cannot make this public, as the patch that caused it was due to embargo'd
security fix.

This issue has highest priority for resolution.
Revert to 3.10.0-862.11.5.el7 in the mean time.

This bug has been marked as a duplicate of bug 1616346

 

 

Comment by Peter Jones [ 16/Aug/18 ]

Thanks for the info!

Comment by Stanford Research Computing Center [ 22/Aug/18 ]

Still no update from Red Hat. 

We're getting more info via The Register:
https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/

“The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.”

But no ETA yet.

Comment by SC Admin (Inactive) [ 22/Aug/18 ]

hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong?

cheers,
robin

Comment by Jeremy Filizetti [ 05/Sep/18 ]

I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.:

https://bugzilla.redhat.com/show_bug.cgi?id=1625620

Comment by Minh Diep [ 07/Sep/18 ]

FYI https://bugs.centos.org/view.php?id=15193

Comment by SC Admin (Inactive) [ 13/Sep/18 ]

yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos.
works fine now.

cheers,
robin

Comment by Stanford Research Computing Center [ 26/Sep/18 ]

Kernel 3.10.0-862.14 has been released, which fixes the issue:

 https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.14.4.el7/x86_64/fd431d51/package

Comment by Bob Glossman [ 26/Sep/18 ]

The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193.  The update .rpm isn't available in Centos mirrors yet though.

 

Comment by Bob Glossman [ 28/Sep/18 ]

the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors

 

Comment by Jian Yu [ 10/Oct/18 ]

RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in LU-11448.

Comment by Peter Jones [ 23/Aug/19 ]

It seems like this was fixed in the next RHEL/CentOS update

Generated at Sat Feb 10 02:42:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.