[LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet Created: 16/Aug/18 Updated: 23/Aug/19 Resolved: 23/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stanford Research Computing Center | Assignee: | Peter Jones |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.5, x86_64 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
| Comments |
| Comment by Stanford Research Computing Center [ 16/Aug/18 ] |
|
A more detailed changelog about that kernel is at https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.11.6.el7/x86_64/fd431d51/package-changelog, if that's of any help. |
| Comment by SC Admin (Inactive) [ 16/Aug/18 ] |
|
we see the same thing on our OPA network. ksocklnd reportedly seems ok with this kernel on our TCP networks (in VMs mostly), so I suspect it's ko2iblnd related. below is syslog from Aug 17 01:58:10 john98 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 36, npartitions: 2 Aug 17 01:58:10 john98 kernel: alg: No test for adler32 (adler32-zlib) Aug 17 01:58:11 john98 kernel: Lustre: Lustre: Build Version: 2.10.4 Aug 17 01:58:11 john98 kernel: LNet: Using FMR for registration Aug 17 01:58:11 john98 kernel: LNet: Added LNI 192.168.44.198@o2ib44 [128/2048/0/180] Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) 192.168.44.198@o2ib44: REJECTED 28 Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) Skipped 3 previous similar messages Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 192.168.44.198@o2ib44: -22 Aug 17 02:06:08 john98 kernel: LNet: 204:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 192.168.44.198@o2ib44 rejected: consumer defined fatal error 2.10.4 was dkms rebuilt for this kernel. cheers, |
| Comment by SC Admin (Inactive) [ 16/Aug/18 ] |
|
actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA? do they work for you? IPoIB works ok. cheers, |
| Comment by Stanford Research Computing Center [ 16/Aug/18 ] |
|
Good observation, indeed: perf tests such as ib_{read,send}_bw and ibv_rc_pingpong fail with errors like: Failed to modify QP to RTR Couldn't connect to remote QP or Failed to modify QP 386 to RTR Unable to Connect the HCA's through the link |
| Comment by Stanford Research Computing Center [ 16/Aug/18 ] |
|
I submitted RHEL bug #1618452 to report the issue: Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end |
| Comment by Peter Jones [ 16/Aug/18 ] |
|
I think that you have to request for them to open it up. |
| Comment by Stanford Research Computing Center [ 16/Aug/18 ] |
|
RHEL's reply:
|
| Comment by Peter Jones [ 16/Aug/18 ] |
|
Thanks for the info! |
| Comment by Stanford Research Computing Center [ 22/Aug/18 ] |
|
Still no update from Red Hat. We're getting more info via The Register:
But no ETA yet. |
| Comment by SC Admin (Inactive) [ 22/Aug/18 ] |
|
hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong? cheers, |
| Comment by Jeremy Filizetti [ 05/Sep/18 ] |
|
I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit. For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.: |
| Comment by Minh Diep [ 07/Sep/18 ] |
| Comment by SC Admin (Inactive) [ 13/Sep/18 ] |
|
yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos. cheers, |
| Comment by Stanford Research Computing Center [ 26/Sep/18 ] |
|
Kernel 3.10.0-862.14 has been released, which fixes the issue: |
| Comment by Bob Glossman [ 26/Sep/18 ] |
|
The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193. The update .rpm isn't available in Centos mirrors yet though.
|
| Comment by Bob Glossman [ 28/Sep/18 ] |
|
the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors
|
| Comment by Jian Yu [ 10/Oct/18 ] |
|
RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in |
| Comment by Peter Jones [ 23/Aug/19 ] |
|
It seems like this was fixed in the next RHEL/CentOS update |